labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th,...
Transcript of labibi Documentation...labibi Documentation, Release 1.0 This workshop was given on May 4th and 5th,...
labibi DocumentationRelease 10
C Titus Brown
November 14 2015
Contents
1 Welcome 311 1 Learning goals 312 2 Safe space and code of conduct 313 3 Instructor introductions 314 4 Amazon and cloud computing - why 315 5 Sticky notes and how they work + Minute Cards 416 6 Warning 4
2 Non-model organisms and RNAseq 521 The overall process 5
3 Getting started with Amazon EC2 731 More details 7
4 Short read quality and trimming 1741 Prepping the computer 1742 Data source 1743 1 Copying in some data to work with 1744 1 Copying data into a working location 1845 2 FastQC 1846 3 Trimmomatic 1947 4 FastQC again 2048 5 Trim the rest of the sequences 2049 6 Interleave the sequences 22
5 Running digital normalization 23
6 Running the actual assembly 25
7 BLASTing your assembled data 27
8 Annotation of denovo transcriptome 2981 Identify the GeneTranscript relationships 2982 Generate the longest-ORF peptide candidates from the Trinity Assembly 2983 Capturing BLAST Homologies 3084 Characterization of functional annotation features 3085 Integration of all annotations into one database 3186 Output an Annotation Report 32
i
9 Quantification and Differential Expression 3391 Download Express 3392 Align Reads with Bowtie 3393 Quantify Expression using eXpress 3494 Differential Expression 34
10 Remapping your reads to your assembled transcriptome 37
11 Miscellaneous advice 39111 Sequencing depth and number of samples 39112 Downloading your data 39113 Developing your own pipeline 40
12 More resources 41121 Informational resources 41122 Places to share data scripts and results files 41
13 Miscellaneous questions 43
14 Tips and Tricks for working with Remote Computers 45141 Use screen to run things that take a long time 45142 Use CyberDuck to transfer files 45143 Subsetting data 45144 Running full analyses on Amazon Web Services 46
15 Technical information 47
ii
labibi Documentation Release 10
This workshop was given on May 4th and 5th 2015 by C Titus Brown Camile Scott and Tamer Mansour See theworkshop organization page for more information or contact Titus directly
Tutorials
Contents 1
labibi Documentation Release 10
2 Contents
CHAPTER 1
Welcome
11 1 Learning goals
For you
bull get a first (or second) look at tools
bull gain some experience in the basic command line
bull get 80 of way to a complete analysis of some data
For us
bull what are the on campus needs who are the on-campus people
12 2 Safe space and code of conduct
This is intended to be a safe and friendly place for learning
Please see the Software Carpentry workshop Code of Conduct httpsoftware-carpentryorgconducthtml
In particular please ask questions because I guarantee you that your question will help others
13 3 Instructor introductions
Titus Brown - prof here at UC Davis in the School of Vet Med
Tamer Mansour - MDPhD postdoc with Titus
Camille Scott - CS graduate student here at Davis
14 4 Amazon and cloud computing - why
bull simplifies software installation
bull can be used for bigger analyses quite easily
bull good for ldquoburstrdquo capacity (just got a data set)
bull accessible everywhere
3
labibi Documentation Release 10
bull they give us $100 gift certificates
15 5 Sticky notes and how they work + Minute Cards
Basic rules
bull no sticky note - ldquoworking on itrdquo
bull green sticky note - ldquoall is wellrdquo
bull red sticky note - ldquoneed helprdquo
Place the sticky notes where we can see them from the back of the room ndash eg on the back of your laptop
At the end of each session (coffee break lunch end of day) please write down on an index card one thing you learnedand one thing yoursquore still confused about
16 6 Warning
The bit that isnrsquot going to work is at the beginning This is unavoidable To reward you we have a coffee break at1030am
mdash
Next Non-model organisms and RNAseq
4 Chapter 1 Welcome
CHAPTER 2
Non-model organisms and RNAseq
With non-model systems where there is neither a good genome nor a lot of mRNAseq data you have to build yourown transcriptome from scratch ndash so-called ldquode novo transcriptome assemblyrdquo There are a few programs to do this -most notably Trinity and Oases - and we have found little difference
The main problem yoursquoll run into with non-model mRNAseq is that the output is fairly noisy with respect to splicevariants Our experience has been that many of these splice variants are probably ldquorealrdquo ndash in the sense of actuallypresent ndash but may be biological ldquonoiserdquo in the sense that they are not actually functional (See this excellent paper byPickrell and Pritchard making the case) Regardless therersquos little that you can do about this although we will talkabout it a bit on the second day
21 The overall process
_staticnonmodel-rnaseq-pipelinepng
bull Copy over your RNAseq data (from two or more samples)
bull Trim primers and junk from sequence (Short read quality and trimming)
bull Do abundance normalization (Running digital normalization)
bull Assemble everything together (Running the actual assembly)
This gives you an assembled transcriptome consisting of many transcripts and transcript families
At this point you can do one or more of the following
bull Annotate your transcripts (Annotation of denovo transcriptome)
bull Quantify your transcripts and examine differential expression (Quantification and Differential Expression)
bull BLAST your transcripts individually (BLASTing your assembled data)
Next Getting started with Amazon EC2
5
labibi Documentation Release 10
6 Chapter 2 Non-model organisms and RNAseq
CHAPTER 3
Getting started with Amazon EC2
Short version
bull go to httpawsamazoncom log in then ldquoEC2rdquo (upper left)
bull select ldquoLaunch instancerdquo
bull select ldquoUbuntu 1404rdquo from the list
bull select ldquom3xlargerdquo from the list (towards bottom of ldquoGeneral purposerdquo)
bull click ldquoReview and launchrdquo
bull select ldquoLaunchrdquo
bull if your first time through create a key pair otherwise select existing
bull click ldquolaunch instancerdquo
31 More details
311 Start up an EC2 instance
Log in
Go to lsquohttpsawsamazoncomlsquo in a Web browser
Select lsquoMy AccountConsolersquo menu option lsquoAWS Management Consolerdquo
Log in with your username amp password
Click on EC2 (upper left)
7
labibi Documentation Release 10
Select your zone
Many of the resources that we use are hosted by Amazon on the East coast Make sure that your dashboard has lsquoNVirginiarsquo on the upper right
Then click on Launch Instance
Select the machine operating system to boot
Find the ldquoUbuntu Server 1404rdquo image in the first list to show up
8 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Choose the machine size
Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo
Confirm and launch
Review the details (ignore the warnings) and click on Launch
31 More details 9
labibi Documentation Release 10
(First time through) generate a new key pair
If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance
(Next times through) select an existing key pair
Select a key pair and click lsquoLaunchrsquo
10 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Click on View Instances
31 More details 11
labibi Documentation Release 10
Select the public DNS name for later use
312 Logging into your new instance ldquoin the cloudrdquo (Windows version)
First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml
Generate a ppk file from your pem file
(You only need to do this once for each key)
Open puttygen select ldquoLoadrdquo
12 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom
Load it
Now ldquosave private keyrdquo Put it somewhere easy to find
31 More details 13
labibi Documentation Release 10
Now that yoursquove generated your PPK file from your PEM file you can log in To do that
Logging into your EC2 instance with Putty
Open up putty and enter your hostname into the Host Name box
Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo
Log in as ldquoubunturdquo
14 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Declare victory
313 Logging into your new instance ldquoin the cloudrdquo (Mac version)
OK so yoursquove created a running computer How do you get to it
The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet
Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows
First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo
Next start Terminal (in Applications Utilities) and type
chmod og-rwx ~Desktopamazonpem
to set the permissions on the private key file to ldquoclosed to all evildoersrdquo
Then type
ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom
Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop
Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in
31 More details 15
labibi Documentation Release 10
At the end you should see text and a prompt that look like this
A final checklist
bull EC2 instance is running
bull used ubuntu 1404
bull NOT a micro instance (m3xlarge or bigger)
314 Amazon Web Services reference material
Instance types
Instance costs
16 Chapter 3 Getting started with Amazon EC2
CHAPTER 4
Short read quality and trimming
Note Reminder if yoursquore on Windows you should install mobaxterm
OK you should now be logged into your Amazon computer How exciting
41 Prepping the computer
Before we do anything else we need to set up a place to work and install a few things
First letrsquos set up a place to work
sudo chmod a+rwxt mnt
This makes lsquomntrsquo a place where we can put data and working files
Next letrsquos install a few things
sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev
These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below
42 Data source
Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone
43 1 Copying in some data to work with
Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata
mkdir mntdata
Next letrsquos grab part of the data set
17
labibi Documentation Release 10
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz
Now if you type
ls -l
you should see something like
-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz
These are subsets of the original data where we selected for reads that belong to a few particular transcripts
One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further
chmod u-w
Wersquoll talk about what these files are below
44 1 Copying data into a working location
First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data
mkdir mntworkcd mntwork
Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash
ln -fs mntdata
These are FASTQ files ndash letrsquos take a look at them
less 0Hour_ATCACG_L002_R1_001extractfastqgz
(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)
Question
bull why do the files have DNA in the name
bull why are there R1 and R2 in the file names
bull why donrsquot we combine all the files
Links
bull FASTQ Format
45 2 FastQC
Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above
Now run FastQC on two files
18 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz
Now type lsquolsrsquo
ls -d fastqc
to list the files and you should see
0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip
We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you
bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml
Questions
bull What should you pay attention to in the FastQC report
bull Which is ldquobetterrdquo R1 or R2 And why
Links
bull FastQC
bull FastQC tutorial video
46 3 Trimmomatic
Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get
The first thing wersquoll need are the adapters to trim off
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa
Now to run Trimmomatic
TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz
0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully
46 3 Trimmomatic 19
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
Contents
1 Welcome 311 1 Learning goals 312 2 Safe space and code of conduct 313 3 Instructor introductions 314 4 Amazon and cloud computing - why 315 5 Sticky notes and how they work + Minute Cards 416 6 Warning 4
2 Non-model organisms and RNAseq 521 The overall process 5
3 Getting started with Amazon EC2 731 More details 7
4 Short read quality and trimming 1741 Prepping the computer 1742 Data source 1743 1 Copying in some data to work with 1744 1 Copying data into a working location 1845 2 FastQC 1846 3 Trimmomatic 1947 4 FastQC again 2048 5 Trim the rest of the sequences 2049 6 Interleave the sequences 22
5 Running digital normalization 23
6 Running the actual assembly 25
7 BLASTing your assembled data 27
8 Annotation of denovo transcriptome 2981 Identify the GeneTranscript relationships 2982 Generate the longest-ORF peptide candidates from the Trinity Assembly 2983 Capturing BLAST Homologies 3084 Characterization of functional annotation features 3085 Integration of all annotations into one database 3186 Output an Annotation Report 32
i
9 Quantification and Differential Expression 3391 Download Express 3392 Align Reads with Bowtie 3393 Quantify Expression using eXpress 3494 Differential Expression 34
10 Remapping your reads to your assembled transcriptome 37
11 Miscellaneous advice 39111 Sequencing depth and number of samples 39112 Downloading your data 39113 Developing your own pipeline 40
12 More resources 41121 Informational resources 41122 Places to share data scripts and results files 41
13 Miscellaneous questions 43
14 Tips and Tricks for working with Remote Computers 45141 Use screen to run things that take a long time 45142 Use CyberDuck to transfer files 45143 Subsetting data 45144 Running full analyses on Amazon Web Services 46
15 Technical information 47
ii
labibi Documentation Release 10
This workshop was given on May 4th and 5th 2015 by C Titus Brown Camile Scott and Tamer Mansour See theworkshop organization page for more information or contact Titus directly
Tutorials
Contents 1
labibi Documentation Release 10
2 Contents
CHAPTER 1
Welcome
11 1 Learning goals
For you
bull get a first (or second) look at tools
bull gain some experience in the basic command line
bull get 80 of way to a complete analysis of some data
For us
bull what are the on campus needs who are the on-campus people
12 2 Safe space and code of conduct
This is intended to be a safe and friendly place for learning
Please see the Software Carpentry workshop Code of Conduct httpsoftware-carpentryorgconducthtml
In particular please ask questions because I guarantee you that your question will help others
13 3 Instructor introductions
Titus Brown - prof here at UC Davis in the School of Vet Med
Tamer Mansour - MDPhD postdoc with Titus
Camille Scott - CS graduate student here at Davis
14 4 Amazon and cloud computing - why
bull simplifies software installation
bull can be used for bigger analyses quite easily
bull good for ldquoburstrdquo capacity (just got a data set)
bull accessible everywhere
3
labibi Documentation Release 10
bull they give us $100 gift certificates
15 5 Sticky notes and how they work + Minute Cards
Basic rules
bull no sticky note - ldquoworking on itrdquo
bull green sticky note - ldquoall is wellrdquo
bull red sticky note - ldquoneed helprdquo
Place the sticky notes where we can see them from the back of the room ndash eg on the back of your laptop
At the end of each session (coffee break lunch end of day) please write down on an index card one thing you learnedand one thing yoursquore still confused about
16 6 Warning
The bit that isnrsquot going to work is at the beginning This is unavoidable To reward you we have a coffee break at1030am
mdash
Next Non-model organisms and RNAseq
4 Chapter 1 Welcome
CHAPTER 2
Non-model organisms and RNAseq
With non-model systems where there is neither a good genome nor a lot of mRNAseq data you have to build yourown transcriptome from scratch ndash so-called ldquode novo transcriptome assemblyrdquo There are a few programs to do this -most notably Trinity and Oases - and we have found little difference
The main problem yoursquoll run into with non-model mRNAseq is that the output is fairly noisy with respect to splicevariants Our experience has been that many of these splice variants are probably ldquorealrdquo ndash in the sense of actuallypresent ndash but may be biological ldquonoiserdquo in the sense that they are not actually functional (See this excellent paper byPickrell and Pritchard making the case) Regardless therersquos little that you can do about this although we will talkabout it a bit on the second day
21 The overall process
_staticnonmodel-rnaseq-pipelinepng
bull Copy over your RNAseq data (from two or more samples)
bull Trim primers and junk from sequence (Short read quality and trimming)
bull Do abundance normalization (Running digital normalization)
bull Assemble everything together (Running the actual assembly)
This gives you an assembled transcriptome consisting of many transcripts and transcript families
At this point you can do one or more of the following
bull Annotate your transcripts (Annotation of denovo transcriptome)
bull Quantify your transcripts and examine differential expression (Quantification and Differential Expression)
bull BLAST your transcripts individually (BLASTing your assembled data)
Next Getting started with Amazon EC2
5
labibi Documentation Release 10
6 Chapter 2 Non-model organisms and RNAseq
CHAPTER 3
Getting started with Amazon EC2
Short version
bull go to httpawsamazoncom log in then ldquoEC2rdquo (upper left)
bull select ldquoLaunch instancerdquo
bull select ldquoUbuntu 1404rdquo from the list
bull select ldquom3xlargerdquo from the list (towards bottom of ldquoGeneral purposerdquo)
bull click ldquoReview and launchrdquo
bull select ldquoLaunchrdquo
bull if your first time through create a key pair otherwise select existing
bull click ldquolaunch instancerdquo
31 More details
311 Start up an EC2 instance
Log in
Go to lsquohttpsawsamazoncomlsquo in a Web browser
Select lsquoMy AccountConsolersquo menu option lsquoAWS Management Consolerdquo
Log in with your username amp password
Click on EC2 (upper left)
7
labibi Documentation Release 10
Select your zone
Many of the resources that we use are hosted by Amazon on the East coast Make sure that your dashboard has lsquoNVirginiarsquo on the upper right
Then click on Launch Instance
Select the machine operating system to boot
Find the ldquoUbuntu Server 1404rdquo image in the first list to show up
8 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Choose the machine size
Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo
Confirm and launch
Review the details (ignore the warnings) and click on Launch
31 More details 9
labibi Documentation Release 10
(First time through) generate a new key pair
If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance
(Next times through) select an existing key pair
Select a key pair and click lsquoLaunchrsquo
10 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Click on View Instances
31 More details 11
labibi Documentation Release 10
Select the public DNS name for later use
312 Logging into your new instance ldquoin the cloudrdquo (Windows version)
First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml
Generate a ppk file from your pem file
(You only need to do this once for each key)
Open puttygen select ldquoLoadrdquo
12 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom
Load it
Now ldquosave private keyrdquo Put it somewhere easy to find
31 More details 13
labibi Documentation Release 10
Now that yoursquove generated your PPK file from your PEM file you can log in To do that
Logging into your EC2 instance with Putty
Open up putty and enter your hostname into the Host Name box
Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo
Log in as ldquoubunturdquo
14 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Declare victory
313 Logging into your new instance ldquoin the cloudrdquo (Mac version)
OK so yoursquove created a running computer How do you get to it
The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet
Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows
First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo
Next start Terminal (in Applications Utilities) and type
chmod og-rwx ~Desktopamazonpem
to set the permissions on the private key file to ldquoclosed to all evildoersrdquo
Then type
ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom
Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop
Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in
31 More details 15
labibi Documentation Release 10
At the end you should see text and a prompt that look like this
A final checklist
bull EC2 instance is running
bull used ubuntu 1404
bull NOT a micro instance (m3xlarge or bigger)
314 Amazon Web Services reference material
Instance types
Instance costs
16 Chapter 3 Getting started with Amazon EC2
CHAPTER 4
Short read quality and trimming
Note Reminder if yoursquore on Windows you should install mobaxterm
OK you should now be logged into your Amazon computer How exciting
41 Prepping the computer
Before we do anything else we need to set up a place to work and install a few things
First letrsquos set up a place to work
sudo chmod a+rwxt mnt
This makes lsquomntrsquo a place where we can put data and working files
Next letrsquos install a few things
sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev
These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below
42 Data source
Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone
43 1 Copying in some data to work with
Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata
mkdir mntdata
Next letrsquos grab part of the data set
17
labibi Documentation Release 10
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz
Now if you type
ls -l
you should see something like
-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz
These are subsets of the original data where we selected for reads that belong to a few particular transcripts
One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further
chmod u-w
Wersquoll talk about what these files are below
44 1 Copying data into a working location
First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data
mkdir mntworkcd mntwork
Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash
ln -fs mntdata
These are FASTQ files ndash letrsquos take a look at them
less 0Hour_ATCACG_L002_R1_001extractfastqgz
(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)
Question
bull why do the files have DNA in the name
bull why are there R1 and R2 in the file names
bull why donrsquot we combine all the files
Links
bull FASTQ Format
45 2 FastQC
Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above
Now run FastQC on two files
18 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz
Now type lsquolsrsquo
ls -d fastqc
to list the files and you should see
0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip
We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you
bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml
Questions
bull What should you pay attention to in the FastQC report
bull Which is ldquobetterrdquo R1 or R2 And why
Links
bull FastQC
bull FastQC tutorial video
46 3 Trimmomatic
Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get
The first thing wersquoll need are the adapters to trim off
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa
Now to run Trimmomatic
TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz
0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully
46 3 Trimmomatic 19
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
9 Quantification and Differential Expression 3391 Download Express 3392 Align Reads with Bowtie 3393 Quantify Expression using eXpress 3494 Differential Expression 34
10 Remapping your reads to your assembled transcriptome 37
11 Miscellaneous advice 39111 Sequencing depth and number of samples 39112 Downloading your data 39113 Developing your own pipeline 40
12 More resources 41121 Informational resources 41122 Places to share data scripts and results files 41
13 Miscellaneous questions 43
14 Tips and Tricks for working with Remote Computers 45141 Use screen to run things that take a long time 45142 Use CyberDuck to transfer files 45143 Subsetting data 45144 Running full analyses on Amazon Web Services 46
15 Technical information 47
ii
labibi Documentation Release 10
This workshop was given on May 4th and 5th 2015 by C Titus Brown Camile Scott and Tamer Mansour See theworkshop organization page for more information or contact Titus directly
Tutorials
Contents 1
labibi Documentation Release 10
2 Contents
CHAPTER 1
Welcome
11 1 Learning goals
For you
bull get a first (or second) look at tools
bull gain some experience in the basic command line
bull get 80 of way to a complete analysis of some data
For us
bull what are the on campus needs who are the on-campus people
12 2 Safe space and code of conduct
This is intended to be a safe and friendly place for learning
Please see the Software Carpentry workshop Code of Conduct httpsoftware-carpentryorgconducthtml
In particular please ask questions because I guarantee you that your question will help others
13 3 Instructor introductions
Titus Brown - prof here at UC Davis in the School of Vet Med
Tamer Mansour - MDPhD postdoc with Titus
Camille Scott - CS graduate student here at Davis
14 4 Amazon and cloud computing - why
bull simplifies software installation
bull can be used for bigger analyses quite easily
bull good for ldquoburstrdquo capacity (just got a data set)
bull accessible everywhere
3
labibi Documentation Release 10
bull they give us $100 gift certificates
15 5 Sticky notes and how they work + Minute Cards
Basic rules
bull no sticky note - ldquoworking on itrdquo
bull green sticky note - ldquoall is wellrdquo
bull red sticky note - ldquoneed helprdquo
Place the sticky notes where we can see them from the back of the room ndash eg on the back of your laptop
At the end of each session (coffee break lunch end of day) please write down on an index card one thing you learnedand one thing yoursquore still confused about
16 6 Warning
The bit that isnrsquot going to work is at the beginning This is unavoidable To reward you we have a coffee break at1030am
mdash
Next Non-model organisms and RNAseq
4 Chapter 1 Welcome
CHAPTER 2
Non-model organisms and RNAseq
With non-model systems where there is neither a good genome nor a lot of mRNAseq data you have to build yourown transcriptome from scratch ndash so-called ldquode novo transcriptome assemblyrdquo There are a few programs to do this -most notably Trinity and Oases - and we have found little difference
The main problem yoursquoll run into with non-model mRNAseq is that the output is fairly noisy with respect to splicevariants Our experience has been that many of these splice variants are probably ldquorealrdquo ndash in the sense of actuallypresent ndash but may be biological ldquonoiserdquo in the sense that they are not actually functional (See this excellent paper byPickrell and Pritchard making the case) Regardless therersquos little that you can do about this although we will talkabout it a bit on the second day
21 The overall process
_staticnonmodel-rnaseq-pipelinepng
bull Copy over your RNAseq data (from two or more samples)
bull Trim primers and junk from sequence (Short read quality and trimming)
bull Do abundance normalization (Running digital normalization)
bull Assemble everything together (Running the actual assembly)
This gives you an assembled transcriptome consisting of many transcripts and transcript families
At this point you can do one or more of the following
bull Annotate your transcripts (Annotation of denovo transcriptome)
bull Quantify your transcripts and examine differential expression (Quantification and Differential Expression)
bull BLAST your transcripts individually (BLASTing your assembled data)
Next Getting started with Amazon EC2
5
labibi Documentation Release 10
6 Chapter 2 Non-model organisms and RNAseq
CHAPTER 3
Getting started with Amazon EC2
Short version
bull go to httpawsamazoncom log in then ldquoEC2rdquo (upper left)
bull select ldquoLaunch instancerdquo
bull select ldquoUbuntu 1404rdquo from the list
bull select ldquom3xlargerdquo from the list (towards bottom of ldquoGeneral purposerdquo)
bull click ldquoReview and launchrdquo
bull select ldquoLaunchrdquo
bull if your first time through create a key pair otherwise select existing
bull click ldquolaunch instancerdquo
31 More details
311 Start up an EC2 instance
Log in
Go to lsquohttpsawsamazoncomlsquo in a Web browser
Select lsquoMy AccountConsolersquo menu option lsquoAWS Management Consolerdquo
Log in with your username amp password
Click on EC2 (upper left)
7
labibi Documentation Release 10
Select your zone
Many of the resources that we use are hosted by Amazon on the East coast Make sure that your dashboard has lsquoNVirginiarsquo on the upper right
Then click on Launch Instance
Select the machine operating system to boot
Find the ldquoUbuntu Server 1404rdquo image in the first list to show up
8 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Choose the machine size
Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo
Confirm and launch
Review the details (ignore the warnings) and click on Launch
31 More details 9
labibi Documentation Release 10
(First time through) generate a new key pair
If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance
(Next times through) select an existing key pair
Select a key pair and click lsquoLaunchrsquo
10 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Click on View Instances
31 More details 11
labibi Documentation Release 10
Select the public DNS name for later use
312 Logging into your new instance ldquoin the cloudrdquo (Windows version)
First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml
Generate a ppk file from your pem file
(You only need to do this once for each key)
Open puttygen select ldquoLoadrdquo
12 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom
Load it
Now ldquosave private keyrdquo Put it somewhere easy to find
31 More details 13
labibi Documentation Release 10
Now that yoursquove generated your PPK file from your PEM file you can log in To do that
Logging into your EC2 instance with Putty
Open up putty and enter your hostname into the Host Name box
Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo
Log in as ldquoubunturdquo
14 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Declare victory
313 Logging into your new instance ldquoin the cloudrdquo (Mac version)
OK so yoursquove created a running computer How do you get to it
The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet
Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows
First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo
Next start Terminal (in Applications Utilities) and type
chmod og-rwx ~Desktopamazonpem
to set the permissions on the private key file to ldquoclosed to all evildoersrdquo
Then type
ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom
Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop
Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in
31 More details 15
labibi Documentation Release 10
At the end you should see text and a prompt that look like this
A final checklist
bull EC2 instance is running
bull used ubuntu 1404
bull NOT a micro instance (m3xlarge or bigger)
314 Amazon Web Services reference material
Instance types
Instance costs
16 Chapter 3 Getting started with Amazon EC2
CHAPTER 4
Short read quality and trimming
Note Reminder if yoursquore on Windows you should install mobaxterm
OK you should now be logged into your Amazon computer How exciting
41 Prepping the computer
Before we do anything else we need to set up a place to work and install a few things
First letrsquos set up a place to work
sudo chmod a+rwxt mnt
This makes lsquomntrsquo a place where we can put data and working files
Next letrsquos install a few things
sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev
These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below
42 Data source
Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone
43 1 Copying in some data to work with
Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata
mkdir mntdata
Next letrsquos grab part of the data set
17
labibi Documentation Release 10
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz
Now if you type
ls -l
you should see something like
-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz
These are subsets of the original data where we selected for reads that belong to a few particular transcripts
One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further
chmod u-w
Wersquoll talk about what these files are below
44 1 Copying data into a working location
First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data
mkdir mntworkcd mntwork
Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash
ln -fs mntdata
These are FASTQ files ndash letrsquos take a look at them
less 0Hour_ATCACG_L002_R1_001extractfastqgz
(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)
Question
bull why do the files have DNA in the name
bull why are there R1 and R2 in the file names
bull why donrsquot we combine all the files
Links
bull FASTQ Format
45 2 FastQC
Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above
Now run FastQC on two files
18 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz
Now type lsquolsrsquo
ls -d fastqc
to list the files and you should see
0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip
We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you
bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml
Questions
bull What should you pay attention to in the FastQC report
bull Which is ldquobetterrdquo R1 or R2 And why
Links
bull FastQC
bull FastQC tutorial video
46 3 Trimmomatic
Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get
The first thing wersquoll need are the adapters to trim off
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa
Now to run Trimmomatic
TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz
0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully
46 3 Trimmomatic 19
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
This workshop was given on May 4th and 5th 2015 by C Titus Brown Camile Scott and Tamer Mansour See theworkshop organization page for more information or contact Titus directly
Tutorials
Contents 1
labibi Documentation Release 10
2 Contents
CHAPTER 1
Welcome
11 1 Learning goals
For you
bull get a first (or second) look at tools
bull gain some experience in the basic command line
bull get 80 of way to a complete analysis of some data
For us
bull what are the on campus needs who are the on-campus people
12 2 Safe space and code of conduct
This is intended to be a safe and friendly place for learning
Please see the Software Carpentry workshop Code of Conduct httpsoftware-carpentryorgconducthtml
In particular please ask questions because I guarantee you that your question will help others
13 3 Instructor introductions
Titus Brown - prof here at UC Davis in the School of Vet Med
Tamer Mansour - MDPhD postdoc with Titus
Camille Scott - CS graduate student here at Davis
14 4 Amazon and cloud computing - why
bull simplifies software installation
bull can be used for bigger analyses quite easily
bull good for ldquoburstrdquo capacity (just got a data set)
bull accessible everywhere
3
labibi Documentation Release 10
bull they give us $100 gift certificates
15 5 Sticky notes and how they work + Minute Cards
Basic rules
bull no sticky note - ldquoworking on itrdquo
bull green sticky note - ldquoall is wellrdquo
bull red sticky note - ldquoneed helprdquo
Place the sticky notes where we can see them from the back of the room ndash eg on the back of your laptop
At the end of each session (coffee break lunch end of day) please write down on an index card one thing you learnedand one thing yoursquore still confused about
16 6 Warning
The bit that isnrsquot going to work is at the beginning This is unavoidable To reward you we have a coffee break at1030am
mdash
Next Non-model organisms and RNAseq
4 Chapter 1 Welcome
CHAPTER 2
Non-model organisms and RNAseq
With non-model systems where there is neither a good genome nor a lot of mRNAseq data you have to build yourown transcriptome from scratch ndash so-called ldquode novo transcriptome assemblyrdquo There are a few programs to do this -most notably Trinity and Oases - and we have found little difference
The main problem yoursquoll run into with non-model mRNAseq is that the output is fairly noisy with respect to splicevariants Our experience has been that many of these splice variants are probably ldquorealrdquo ndash in the sense of actuallypresent ndash but may be biological ldquonoiserdquo in the sense that they are not actually functional (See this excellent paper byPickrell and Pritchard making the case) Regardless therersquos little that you can do about this although we will talkabout it a bit on the second day
21 The overall process
_staticnonmodel-rnaseq-pipelinepng
bull Copy over your RNAseq data (from two or more samples)
bull Trim primers and junk from sequence (Short read quality and trimming)
bull Do abundance normalization (Running digital normalization)
bull Assemble everything together (Running the actual assembly)
This gives you an assembled transcriptome consisting of many transcripts and transcript families
At this point you can do one or more of the following
bull Annotate your transcripts (Annotation of denovo transcriptome)
bull Quantify your transcripts and examine differential expression (Quantification and Differential Expression)
bull BLAST your transcripts individually (BLASTing your assembled data)
Next Getting started with Amazon EC2
5
labibi Documentation Release 10
6 Chapter 2 Non-model organisms and RNAseq
CHAPTER 3
Getting started with Amazon EC2
Short version
bull go to httpawsamazoncom log in then ldquoEC2rdquo (upper left)
bull select ldquoLaunch instancerdquo
bull select ldquoUbuntu 1404rdquo from the list
bull select ldquom3xlargerdquo from the list (towards bottom of ldquoGeneral purposerdquo)
bull click ldquoReview and launchrdquo
bull select ldquoLaunchrdquo
bull if your first time through create a key pair otherwise select existing
bull click ldquolaunch instancerdquo
31 More details
311 Start up an EC2 instance
Log in
Go to lsquohttpsawsamazoncomlsquo in a Web browser
Select lsquoMy AccountConsolersquo menu option lsquoAWS Management Consolerdquo
Log in with your username amp password
Click on EC2 (upper left)
7
labibi Documentation Release 10
Select your zone
Many of the resources that we use are hosted by Amazon on the East coast Make sure that your dashboard has lsquoNVirginiarsquo on the upper right
Then click on Launch Instance
Select the machine operating system to boot
Find the ldquoUbuntu Server 1404rdquo image in the first list to show up
8 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Choose the machine size
Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo
Confirm and launch
Review the details (ignore the warnings) and click on Launch
31 More details 9
labibi Documentation Release 10
(First time through) generate a new key pair
If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance
(Next times through) select an existing key pair
Select a key pair and click lsquoLaunchrsquo
10 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Click on View Instances
31 More details 11
labibi Documentation Release 10
Select the public DNS name for later use
312 Logging into your new instance ldquoin the cloudrdquo (Windows version)
First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml
Generate a ppk file from your pem file
(You only need to do this once for each key)
Open puttygen select ldquoLoadrdquo
12 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom
Load it
Now ldquosave private keyrdquo Put it somewhere easy to find
31 More details 13
labibi Documentation Release 10
Now that yoursquove generated your PPK file from your PEM file you can log in To do that
Logging into your EC2 instance with Putty
Open up putty and enter your hostname into the Host Name box
Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo
Log in as ldquoubunturdquo
14 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Declare victory
313 Logging into your new instance ldquoin the cloudrdquo (Mac version)
OK so yoursquove created a running computer How do you get to it
The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet
Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows
First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo
Next start Terminal (in Applications Utilities) and type
chmod og-rwx ~Desktopamazonpem
to set the permissions on the private key file to ldquoclosed to all evildoersrdquo
Then type
ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom
Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop
Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in
31 More details 15
labibi Documentation Release 10
At the end you should see text and a prompt that look like this
A final checklist
bull EC2 instance is running
bull used ubuntu 1404
bull NOT a micro instance (m3xlarge or bigger)
314 Amazon Web Services reference material
Instance types
Instance costs
16 Chapter 3 Getting started with Amazon EC2
CHAPTER 4
Short read quality and trimming
Note Reminder if yoursquore on Windows you should install mobaxterm
OK you should now be logged into your Amazon computer How exciting
41 Prepping the computer
Before we do anything else we need to set up a place to work and install a few things
First letrsquos set up a place to work
sudo chmod a+rwxt mnt
This makes lsquomntrsquo a place where we can put data and working files
Next letrsquos install a few things
sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev
These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below
42 Data source
Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone
43 1 Copying in some data to work with
Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata
mkdir mntdata
Next letrsquos grab part of the data set
17
labibi Documentation Release 10
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz
Now if you type
ls -l
you should see something like
-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz
These are subsets of the original data where we selected for reads that belong to a few particular transcripts
One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further
chmod u-w
Wersquoll talk about what these files are below
44 1 Copying data into a working location
First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data
mkdir mntworkcd mntwork
Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash
ln -fs mntdata
These are FASTQ files ndash letrsquos take a look at them
less 0Hour_ATCACG_L002_R1_001extractfastqgz
(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)
Question
bull why do the files have DNA in the name
bull why are there R1 and R2 in the file names
bull why donrsquot we combine all the files
Links
bull FASTQ Format
45 2 FastQC
Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above
Now run FastQC on two files
18 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz
Now type lsquolsrsquo
ls -d fastqc
to list the files and you should see
0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip
We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you
bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml
Questions
bull What should you pay attention to in the FastQC report
bull Which is ldquobetterrdquo R1 or R2 And why
Links
bull FastQC
bull FastQC tutorial video
46 3 Trimmomatic
Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get
The first thing wersquoll need are the adapters to trim off
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa
Now to run Trimmomatic
TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz
0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully
46 3 Trimmomatic 19
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
2 Contents
CHAPTER 1
Welcome
11 1 Learning goals
For you
bull get a first (or second) look at tools
bull gain some experience in the basic command line
bull get 80 of way to a complete analysis of some data
For us
bull what are the on campus needs who are the on-campus people
12 2 Safe space and code of conduct
This is intended to be a safe and friendly place for learning
Please see the Software Carpentry workshop Code of Conduct httpsoftware-carpentryorgconducthtml
In particular please ask questions because I guarantee you that your question will help others
13 3 Instructor introductions
Titus Brown - prof here at UC Davis in the School of Vet Med
Tamer Mansour - MDPhD postdoc with Titus
Camille Scott - CS graduate student here at Davis
14 4 Amazon and cloud computing - why
bull simplifies software installation
bull can be used for bigger analyses quite easily
bull good for ldquoburstrdquo capacity (just got a data set)
bull accessible everywhere
3
labibi Documentation Release 10
bull they give us $100 gift certificates
15 5 Sticky notes and how they work + Minute Cards
Basic rules
bull no sticky note - ldquoworking on itrdquo
bull green sticky note - ldquoall is wellrdquo
bull red sticky note - ldquoneed helprdquo
Place the sticky notes where we can see them from the back of the room ndash eg on the back of your laptop
At the end of each session (coffee break lunch end of day) please write down on an index card one thing you learnedand one thing yoursquore still confused about
16 6 Warning
The bit that isnrsquot going to work is at the beginning This is unavoidable To reward you we have a coffee break at1030am
mdash
Next Non-model organisms and RNAseq
4 Chapter 1 Welcome
CHAPTER 2
Non-model organisms and RNAseq
With non-model systems where there is neither a good genome nor a lot of mRNAseq data you have to build yourown transcriptome from scratch ndash so-called ldquode novo transcriptome assemblyrdquo There are a few programs to do this -most notably Trinity and Oases - and we have found little difference
The main problem yoursquoll run into with non-model mRNAseq is that the output is fairly noisy with respect to splicevariants Our experience has been that many of these splice variants are probably ldquorealrdquo ndash in the sense of actuallypresent ndash but may be biological ldquonoiserdquo in the sense that they are not actually functional (See this excellent paper byPickrell and Pritchard making the case) Regardless therersquos little that you can do about this although we will talkabout it a bit on the second day
21 The overall process
_staticnonmodel-rnaseq-pipelinepng
bull Copy over your RNAseq data (from two or more samples)
bull Trim primers and junk from sequence (Short read quality and trimming)
bull Do abundance normalization (Running digital normalization)
bull Assemble everything together (Running the actual assembly)
This gives you an assembled transcriptome consisting of many transcripts and transcript families
At this point you can do one or more of the following
bull Annotate your transcripts (Annotation of denovo transcriptome)
bull Quantify your transcripts and examine differential expression (Quantification and Differential Expression)
bull BLAST your transcripts individually (BLASTing your assembled data)
Next Getting started with Amazon EC2
5
labibi Documentation Release 10
6 Chapter 2 Non-model organisms and RNAseq
CHAPTER 3
Getting started with Amazon EC2
Short version
bull go to httpawsamazoncom log in then ldquoEC2rdquo (upper left)
bull select ldquoLaunch instancerdquo
bull select ldquoUbuntu 1404rdquo from the list
bull select ldquom3xlargerdquo from the list (towards bottom of ldquoGeneral purposerdquo)
bull click ldquoReview and launchrdquo
bull select ldquoLaunchrdquo
bull if your first time through create a key pair otherwise select existing
bull click ldquolaunch instancerdquo
31 More details
311 Start up an EC2 instance
Log in
Go to lsquohttpsawsamazoncomlsquo in a Web browser
Select lsquoMy AccountConsolersquo menu option lsquoAWS Management Consolerdquo
Log in with your username amp password
Click on EC2 (upper left)
7
labibi Documentation Release 10
Select your zone
Many of the resources that we use are hosted by Amazon on the East coast Make sure that your dashboard has lsquoNVirginiarsquo on the upper right
Then click on Launch Instance
Select the machine operating system to boot
Find the ldquoUbuntu Server 1404rdquo image in the first list to show up
8 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Choose the machine size
Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo
Confirm and launch
Review the details (ignore the warnings) and click on Launch
31 More details 9
labibi Documentation Release 10
(First time through) generate a new key pair
If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance
(Next times through) select an existing key pair
Select a key pair and click lsquoLaunchrsquo
10 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Click on View Instances
31 More details 11
labibi Documentation Release 10
Select the public DNS name for later use
312 Logging into your new instance ldquoin the cloudrdquo (Windows version)
First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml
Generate a ppk file from your pem file
(You only need to do this once for each key)
Open puttygen select ldquoLoadrdquo
12 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom
Load it
Now ldquosave private keyrdquo Put it somewhere easy to find
31 More details 13
labibi Documentation Release 10
Now that yoursquove generated your PPK file from your PEM file you can log in To do that
Logging into your EC2 instance with Putty
Open up putty and enter your hostname into the Host Name box
Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo
Log in as ldquoubunturdquo
14 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Declare victory
313 Logging into your new instance ldquoin the cloudrdquo (Mac version)
OK so yoursquove created a running computer How do you get to it
The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet
Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows
First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo
Next start Terminal (in Applications Utilities) and type
chmod og-rwx ~Desktopamazonpem
to set the permissions on the private key file to ldquoclosed to all evildoersrdquo
Then type
ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom
Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop
Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in
31 More details 15
labibi Documentation Release 10
At the end you should see text and a prompt that look like this
A final checklist
bull EC2 instance is running
bull used ubuntu 1404
bull NOT a micro instance (m3xlarge or bigger)
314 Amazon Web Services reference material
Instance types
Instance costs
16 Chapter 3 Getting started with Amazon EC2
CHAPTER 4
Short read quality and trimming
Note Reminder if yoursquore on Windows you should install mobaxterm
OK you should now be logged into your Amazon computer How exciting
41 Prepping the computer
Before we do anything else we need to set up a place to work and install a few things
First letrsquos set up a place to work
sudo chmod a+rwxt mnt
This makes lsquomntrsquo a place where we can put data and working files
Next letrsquos install a few things
sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev
These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below
42 Data source
Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone
43 1 Copying in some data to work with
Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata
mkdir mntdata
Next letrsquos grab part of the data set
17
labibi Documentation Release 10
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz
Now if you type
ls -l
you should see something like
-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz
These are subsets of the original data where we selected for reads that belong to a few particular transcripts
One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further
chmod u-w
Wersquoll talk about what these files are below
44 1 Copying data into a working location
First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data
mkdir mntworkcd mntwork
Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash
ln -fs mntdata
These are FASTQ files ndash letrsquos take a look at them
less 0Hour_ATCACG_L002_R1_001extractfastqgz
(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)
Question
bull why do the files have DNA in the name
bull why are there R1 and R2 in the file names
bull why donrsquot we combine all the files
Links
bull FASTQ Format
45 2 FastQC
Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above
Now run FastQC on two files
18 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz
Now type lsquolsrsquo
ls -d fastqc
to list the files and you should see
0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip
We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you
bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml
Questions
bull What should you pay attention to in the FastQC report
bull Which is ldquobetterrdquo R1 or R2 And why
Links
bull FastQC
bull FastQC tutorial video
46 3 Trimmomatic
Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get
The first thing wersquoll need are the adapters to trim off
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa
Now to run Trimmomatic
TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz
0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully
46 3 Trimmomatic 19
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
CHAPTER 1
Welcome
11 1 Learning goals
For you
bull get a first (or second) look at tools
bull gain some experience in the basic command line
bull get 80 of way to a complete analysis of some data
For us
bull what are the on campus needs who are the on-campus people
12 2 Safe space and code of conduct
This is intended to be a safe and friendly place for learning
Please see the Software Carpentry workshop Code of Conduct httpsoftware-carpentryorgconducthtml
In particular please ask questions because I guarantee you that your question will help others
13 3 Instructor introductions
Titus Brown - prof here at UC Davis in the School of Vet Med
Tamer Mansour - MDPhD postdoc with Titus
Camille Scott - CS graduate student here at Davis
14 4 Amazon and cloud computing - why
bull simplifies software installation
bull can be used for bigger analyses quite easily
bull good for ldquoburstrdquo capacity (just got a data set)
bull accessible everywhere
3
labibi Documentation Release 10
bull they give us $100 gift certificates
15 5 Sticky notes and how they work + Minute Cards
Basic rules
bull no sticky note - ldquoworking on itrdquo
bull green sticky note - ldquoall is wellrdquo
bull red sticky note - ldquoneed helprdquo
Place the sticky notes where we can see them from the back of the room ndash eg on the back of your laptop
At the end of each session (coffee break lunch end of day) please write down on an index card one thing you learnedand one thing yoursquore still confused about
16 6 Warning
The bit that isnrsquot going to work is at the beginning This is unavoidable To reward you we have a coffee break at1030am
mdash
Next Non-model organisms and RNAseq
4 Chapter 1 Welcome
CHAPTER 2
Non-model organisms and RNAseq
With non-model systems where there is neither a good genome nor a lot of mRNAseq data you have to build yourown transcriptome from scratch ndash so-called ldquode novo transcriptome assemblyrdquo There are a few programs to do this -most notably Trinity and Oases - and we have found little difference
The main problem yoursquoll run into with non-model mRNAseq is that the output is fairly noisy with respect to splicevariants Our experience has been that many of these splice variants are probably ldquorealrdquo ndash in the sense of actuallypresent ndash but may be biological ldquonoiserdquo in the sense that they are not actually functional (See this excellent paper byPickrell and Pritchard making the case) Regardless therersquos little that you can do about this although we will talkabout it a bit on the second day
21 The overall process
_staticnonmodel-rnaseq-pipelinepng
bull Copy over your RNAseq data (from two or more samples)
bull Trim primers and junk from sequence (Short read quality and trimming)
bull Do abundance normalization (Running digital normalization)
bull Assemble everything together (Running the actual assembly)
This gives you an assembled transcriptome consisting of many transcripts and transcript families
At this point you can do one or more of the following
bull Annotate your transcripts (Annotation of denovo transcriptome)
bull Quantify your transcripts and examine differential expression (Quantification and Differential Expression)
bull BLAST your transcripts individually (BLASTing your assembled data)
Next Getting started with Amazon EC2
5
labibi Documentation Release 10
6 Chapter 2 Non-model organisms and RNAseq
CHAPTER 3
Getting started with Amazon EC2
Short version
bull go to httpawsamazoncom log in then ldquoEC2rdquo (upper left)
bull select ldquoLaunch instancerdquo
bull select ldquoUbuntu 1404rdquo from the list
bull select ldquom3xlargerdquo from the list (towards bottom of ldquoGeneral purposerdquo)
bull click ldquoReview and launchrdquo
bull select ldquoLaunchrdquo
bull if your first time through create a key pair otherwise select existing
bull click ldquolaunch instancerdquo
31 More details
311 Start up an EC2 instance
Log in
Go to lsquohttpsawsamazoncomlsquo in a Web browser
Select lsquoMy AccountConsolersquo menu option lsquoAWS Management Consolerdquo
Log in with your username amp password
Click on EC2 (upper left)
7
labibi Documentation Release 10
Select your zone
Many of the resources that we use are hosted by Amazon on the East coast Make sure that your dashboard has lsquoNVirginiarsquo on the upper right
Then click on Launch Instance
Select the machine operating system to boot
Find the ldquoUbuntu Server 1404rdquo image in the first list to show up
8 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Choose the machine size
Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo
Confirm and launch
Review the details (ignore the warnings) and click on Launch
31 More details 9
labibi Documentation Release 10
(First time through) generate a new key pair
If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance
(Next times through) select an existing key pair
Select a key pair and click lsquoLaunchrsquo
10 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Click on View Instances
31 More details 11
labibi Documentation Release 10
Select the public DNS name for later use
312 Logging into your new instance ldquoin the cloudrdquo (Windows version)
First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml
Generate a ppk file from your pem file
(You only need to do this once for each key)
Open puttygen select ldquoLoadrdquo
12 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom
Load it
Now ldquosave private keyrdquo Put it somewhere easy to find
31 More details 13
labibi Documentation Release 10
Now that yoursquove generated your PPK file from your PEM file you can log in To do that
Logging into your EC2 instance with Putty
Open up putty and enter your hostname into the Host Name box
Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo
Log in as ldquoubunturdquo
14 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Declare victory
313 Logging into your new instance ldquoin the cloudrdquo (Mac version)
OK so yoursquove created a running computer How do you get to it
The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet
Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows
First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo
Next start Terminal (in Applications Utilities) and type
chmod og-rwx ~Desktopamazonpem
to set the permissions on the private key file to ldquoclosed to all evildoersrdquo
Then type
ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom
Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop
Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in
31 More details 15
labibi Documentation Release 10
At the end you should see text and a prompt that look like this
A final checklist
bull EC2 instance is running
bull used ubuntu 1404
bull NOT a micro instance (m3xlarge or bigger)
314 Amazon Web Services reference material
Instance types
Instance costs
16 Chapter 3 Getting started with Amazon EC2
CHAPTER 4
Short read quality and trimming
Note Reminder if yoursquore on Windows you should install mobaxterm
OK you should now be logged into your Amazon computer How exciting
41 Prepping the computer
Before we do anything else we need to set up a place to work and install a few things
First letrsquos set up a place to work
sudo chmod a+rwxt mnt
This makes lsquomntrsquo a place where we can put data and working files
Next letrsquos install a few things
sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev
These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below
42 Data source
Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone
43 1 Copying in some data to work with
Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata
mkdir mntdata
Next letrsquos grab part of the data set
17
labibi Documentation Release 10
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz
Now if you type
ls -l
you should see something like
-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz
These are subsets of the original data where we selected for reads that belong to a few particular transcripts
One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further
chmod u-w
Wersquoll talk about what these files are below
44 1 Copying data into a working location
First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data
mkdir mntworkcd mntwork
Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash
ln -fs mntdata
These are FASTQ files ndash letrsquos take a look at them
less 0Hour_ATCACG_L002_R1_001extractfastqgz
(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)
Question
bull why do the files have DNA in the name
bull why are there R1 and R2 in the file names
bull why donrsquot we combine all the files
Links
bull FASTQ Format
45 2 FastQC
Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above
Now run FastQC on two files
18 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz
Now type lsquolsrsquo
ls -d fastqc
to list the files and you should see
0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip
We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you
bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml
Questions
bull What should you pay attention to in the FastQC report
bull Which is ldquobetterrdquo R1 or R2 And why
Links
bull FastQC
bull FastQC tutorial video
46 3 Trimmomatic
Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get
The first thing wersquoll need are the adapters to trim off
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa
Now to run Trimmomatic
TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz
0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully
46 3 Trimmomatic 19
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
bull they give us $100 gift certificates
15 5 Sticky notes and how they work + Minute Cards
Basic rules
bull no sticky note - ldquoworking on itrdquo
bull green sticky note - ldquoall is wellrdquo
bull red sticky note - ldquoneed helprdquo
Place the sticky notes where we can see them from the back of the room ndash eg on the back of your laptop
At the end of each session (coffee break lunch end of day) please write down on an index card one thing you learnedand one thing yoursquore still confused about
16 6 Warning
The bit that isnrsquot going to work is at the beginning This is unavoidable To reward you we have a coffee break at1030am
mdash
Next Non-model organisms and RNAseq
4 Chapter 1 Welcome
CHAPTER 2
Non-model organisms and RNAseq
With non-model systems where there is neither a good genome nor a lot of mRNAseq data you have to build yourown transcriptome from scratch ndash so-called ldquode novo transcriptome assemblyrdquo There are a few programs to do this -most notably Trinity and Oases - and we have found little difference
The main problem yoursquoll run into with non-model mRNAseq is that the output is fairly noisy with respect to splicevariants Our experience has been that many of these splice variants are probably ldquorealrdquo ndash in the sense of actuallypresent ndash but may be biological ldquonoiserdquo in the sense that they are not actually functional (See this excellent paper byPickrell and Pritchard making the case) Regardless therersquos little that you can do about this although we will talkabout it a bit on the second day
21 The overall process
_staticnonmodel-rnaseq-pipelinepng
bull Copy over your RNAseq data (from two or more samples)
bull Trim primers and junk from sequence (Short read quality and trimming)
bull Do abundance normalization (Running digital normalization)
bull Assemble everything together (Running the actual assembly)
This gives you an assembled transcriptome consisting of many transcripts and transcript families
At this point you can do one or more of the following
bull Annotate your transcripts (Annotation of denovo transcriptome)
bull Quantify your transcripts and examine differential expression (Quantification and Differential Expression)
bull BLAST your transcripts individually (BLASTing your assembled data)
Next Getting started with Amazon EC2
5
labibi Documentation Release 10
6 Chapter 2 Non-model organisms and RNAseq
CHAPTER 3
Getting started with Amazon EC2
Short version
bull go to httpawsamazoncom log in then ldquoEC2rdquo (upper left)
bull select ldquoLaunch instancerdquo
bull select ldquoUbuntu 1404rdquo from the list
bull select ldquom3xlargerdquo from the list (towards bottom of ldquoGeneral purposerdquo)
bull click ldquoReview and launchrdquo
bull select ldquoLaunchrdquo
bull if your first time through create a key pair otherwise select existing
bull click ldquolaunch instancerdquo
31 More details
311 Start up an EC2 instance
Log in
Go to lsquohttpsawsamazoncomlsquo in a Web browser
Select lsquoMy AccountConsolersquo menu option lsquoAWS Management Consolerdquo
Log in with your username amp password
Click on EC2 (upper left)
7
labibi Documentation Release 10
Select your zone
Many of the resources that we use are hosted by Amazon on the East coast Make sure that your dashboard has lsquoNVirginiarsquo on the upper right
Then click on Launch Instance
Select the machine operating system to boot
Find the ldquoUbuntu Server 1404rdquo image in the first list to show up
8 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Choose the machine size
Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo
Confirm and launch
Review the details (ignore the warnings) and click on Launch
31 More details 9
labibi Documentation Release 10
(First time through) generate a new key pair
If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance
(Next times through) select an existing key pair
Select a key pair and click lsquoLaunchrsquo
10 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Click on View Instances
31 More details 11
labibi Documentation Release 10
Select the public DNS name for later use
312 Logging into your new instance ldquoin the cloudrdquo (Windows version)
First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml
Generate a ppk file from your pem file
(You only need to do this once for each key)
Open puttygen select ldquoLoadrdquo
12 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom
Load it
Now ldquosave private keyrdquo Put it somewhere easy to find
31 More details 13
labibi Documentation Release 10
Now that yoursquove generated your PPK file from your PEM file you can log in To do that
Logging into your EC2 instance with Putty
Open up putty and enter your hostname into the Host Name box
Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo
Log in as ldquoubunturdquo
14 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Declare victory
313 Logging into your new instance ldquoin the cloudrdquo (Mac version)
OK so yoursquove created a running computer How do you get to it
The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet
Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows
First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo
Next start Terminal (in Applications Utilities) and type
chmod og-rwx ~Desktopamazonpem
to set the permissions on the private key file to ldquoclosed to all evildoersrdquo
Then type
ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom
Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop
Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in
31 More details 15
labibi Documentation Release 10
At the end you should see text and a prompt that look like this
A final checklist
bull EC2 instance is running
bull used ubuntu 1404
bull NOT a micro instance (m3xlarge or bigger)
314 Amazon Web Services reference material
Instance types
Instance costs
16 Chapter 3 Getting started with Amazon EC2
CHAPTER 4
Short read quality and trimming
Note Reminder if yoursquore on Windows you should install mobaxterm
OK you should now be logged into your Amazon computer How exciting
41 Prepping the computer
Before we do anything else we need to set up a place to work and install a few things
First letrsquos set up a place to work
sudo chmod a+rwxt mnt
This makes lsquomntrsquo a place where we can put data and working files
Next letrsquos install a few things
sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev
These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below
42 Data source
Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone
43 1 Copying in some data to work with
Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata
mkdir mntdata
Next letrsquos grab part of the data set
17
labibi Documentation Release 10
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz
Now if you type
ls -l
you should see something like
-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz
These are subsets of the original data where we selected for reads that belong to a few particular transcripts
One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further
chmod u-w
Wersquoll talk about what these files are below
44 1 Copying data into a working location
First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data
mkdir mntworkcd mntwork
Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash
ln -fs mntdata
These are FASTQ files ndash letrsquos take a look at them
less 0Hour_ATCACG_L002_R1_001extractfastqgz
(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)
Question
bull why do the files have DNA in the name
bull why are there R1 and R2 in the file names
bull why donrsquot we combine all the files
Links
bull FASTQ Format
45 2 FastQC
Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above
Now run FastQC on two files
18 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz
Now type lsquolsrsquo
ls -d fastqc
to list the files and you should see
0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip
We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you
bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml
Questions
bull What should you pay attention to in the FastQC report
bull Which is ldquobetterrdquo R1 or R2 And why
Links
bull FastQC
bull FastQC tutorial video
46 3 Trimmomatic
Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get
The first thing wersquoll need are the adapters to trim off
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa
Now to run Trimmomatic
TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz
0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully
46 3 Trimmomatic 19
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
CHAPTER 2
Non-model organisms and RNAseq
With non-model systems where there is neither a good genome nor a lot of mRNAseq data you have to build yourown transcriptome from scratch ndash so-called ldquode novo transcriptome assemblyrdquo There are a few programs to do this -most notably Trinity and Oases - and we have found little difference
The main problem yoursquoll run into with non-model mRNAseq is that the output is fairly noisy with respect to splicevariants Our experience has been that many of these splice variants are probably ldquorealrdquo ndash in the sense of actuallypresent ndash but may be biological ldquonoiserdquo in the sense that they are not actually functional (See this excellent paper byPickrell and Pritchard making the case) Regardless therersquos little that you can do about this although we will talkabout it a bit on the second day
21 The overall process
_staticnonmodel-rnaseq-pipelinepng
bull Copy over your RNAseq data (from two or more samples)
bull Trim primers and junk from sequence (Short read quality and trimming)
bull Do abundance normalization (Running digital normalization)
bull Assemble everything together (Running the actual assembly)
This gives you an assembled transcriptome consisting of many transcripts and transcript families
At this point you can do one or more of the following
bull Annotate your transcripts (Annotation of denovo transcriptome)
bull Quantify your transcripts and examine differential expression (Quantification and Differential Expression)
bull BLAST your transcripts individually (BLASTing your assembled data)
Next Getting started with Amazon EC2
5
labibi Documentation Release 10
6 Chapter 2 Non-model organisms and RNAseq
CHAPTER 3
Getting started with Amazon EC2
Short version
bull go to httpawsamazoncom log in then ldquoEC2rdquo (upper left)
bull select ldquoLaunch instancerdquo
bull select ldquoUbuntu 1404rdquo from the list
bull select ldquom3xlargerdquo from the list (towards bottom of ldquoGeneral purposerdquo)
bull click ldquoReview and launchrdquo
bull select ldquoLaunchrdquo
bull if your first time through create a key pair otherwise select existing
bull click ldquolaunch instancerdquo
31 More details
311 Start up an EC2 instance
Log in
Go to lsquohttpsawsamazoncomlsquo in a Web browser
Select lsquoMy AccountConsolersquo menu option lsquoAWS Management Consolerdquo
Log in with your username amp password
Click on EC2 (upper left)
7
labibi Documentation Release 10
Select your zone
Many of the resources that we use are hosted by Amazon on the East coast Make sure that your dashboard has lsquoNVirginiarsquo on the upper right
Then click on Launch Instance
Select the machine operating system to boot
Find the ldquoUbuntu Server 1404rdquo image in the first list to show up
8 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Choose the machine size
Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo
Confirm and launch
Review the details (ignore the warnings) and click on Launch
31 More details 9
labibi Documentation Release 10
(First time through) generate a new key pair
If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance
(Next times through) select an existing key pair
Select a key pair and click lsquoLaunchrsquo
10 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Click on View Instances
31 More details 11
labibi Documentation Release 10
Select the public DNS name for later use
312 Logging into your new instance ldquoin the cloudrdquo (Windows version)
First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml
Generate a ppk file from your pem file
(You only need to do this once for each key)
Open puttygen select ldquoLoadrdquo
12 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom
Load it
Now ldquosave private keyrdquo Put it somewhere easy to find
31 More details 13
labibi Documentation Release 10
Now that yoursquove generated your PPK file from your PEM file you can log in To do that
Logging into your EC2 instance with Putty
Open up putty and enter your hostname into the Host Name box
Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo
Log in as ldquoubunturdquo
14 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Declare victory
313 Logging into your new instance ldquoin the cloudrdquo (Mac version)
OK so yoursquove created a running computer How do you get to it
The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet
Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows
First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo
Next start Terminal (in Applications Utilities) and type
chmod og-rwx ~Desktopamazonpem
to set the permissions on the private key file to ldquoclosed to all evildoersrdquo
Then type
ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom
Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop
Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in
31 More details 15
labibi Documentation Release 10
At the end you should see text and a prompt that look like this
A final checklist
bull EC2 instance is running
bull used ubuntu 1404
bull NOT a micro instance (m3xlarge or bigger)
314 Amazon Web Services reference material
Instance types
Instance costs
16 Chapter 3 Getting started with Amazon EC2
CHAPTER 4
Short read quality and trimming
Note Reminder if yoursquore on Windows you should install mobaxterm
OK you should now be logged into your Amazon computer How exciting
41 Prepping the computer
Before we do anything else we need to set up a place to work and install a few things
First letrsquos set up a place to work
sudo chmod a+rwxt mnt
This makes lsquomntrsquo a place where we can put data and working files
Next letrsquos install a few things
sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev
These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below
42 Data source
Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone
43 1 Copying in some data to work with
Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata
mkdir mntdata
Next letrsquos grab part of the data set
17
labibi Documentation Release 10
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz
Now if you type
ls -l
you should see something like
-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz
These are subsets of the original data where we selected for reads that belong to a few particular transcripts
One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further
chmod u-w
Wersquoll talk about what these files are below
44 1 Copying data into a working location
First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data
mkdir mntworkcd mntwork
Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash
ln -fs mntdata
These are FASTQ files ndash letrsquos take a look at them
less 0Hour_ATCACG_L002_R1_001extractfastqgz
(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)
Question
bull why do the files have DNA in the name
bull why are there R1 and R2 in the file names
bull why donrsquot we combine all the files
Links
bull FASTQ Format
45 2 FastQC
Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above
Now run FastQC on two files
18 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz
Now type lsquolsrsquo
ls -d fastqc
to list the files and you should see
0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip
We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you
bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml
Questions
bull What should you pay attention to in the FastQC report
bull Which is ldquobetterrdquo R1 or R2 And why
Links
bull FastQC
bull FastQC tutorial video
46 3 Trimmomatic
Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get
The first thing wersquoll need are the adapters to trim off
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa
Now to run Trimmomatic
TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz
0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully
46 3 Trimmomatic 19
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
6 Chapter 2 Non-model organisms and RNAseq
CHAPTER 3
Getting started with Amazon EC2
Short version
bull go to httpawsamazoncom log in then ldquoEC2rdquo (upper left)
bull select ldquoLaunch instancerdquo
bull select ldquoUbuntu 1404rdquo from the list
bull select ldquom3xlargerdquo from the list (towards bottom of ldquoGeneral purposerdquo)
bull click ldquoReview and launchrdquo
bull select ldquoLaunchrdquo
bull if your first time through create a key pair otherwise select existing
bull click ldquolaunch instancerdquo
31 More details
311 Start up an EC2 instance
Log in
Go to lsquohttpsawsamazoncomlsquo in a Web browser
Select lsquoMy AccountConsolersquo menu option lsquoAWS Management Consolerdquo
Log in with your username amp password
Click on EC2 (upper left)
7
labibi Documentation Release 10
Select your zone
Many of the resources that we use are hosted by Amazon on the East coast Make sure that your dashboard has lsquoNVirginiarsquo on the upper right
Then click on Launch Instance
Select the machine operating system to boot
Find the ldquoUbuntu Server 1404rdquo image in the first list to show up
8 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Choose the machine size
Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo
Confirm and launch
Review the details (ignore the warnings) and click on Launch
31 More details 9
labibi Documentation Release 10
(First time through) generate a new key pair
If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance
(Next times through) select an existing key pair
Select a key pair and click lsquoLaunchrsquo
10 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Click on View Instances
31 More details 11
labibi Documentation Release 10
Select the public DNS name for later use
312 Logging into your new instance ldquoin the cloudrdquo (Windows version)
First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml
Generate a ppk file from your pem file
(You only need to do this once for each key)
Open puttygen select ldquoLoadrdquo
12 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom
Load it
Now ldquosave private keyrdquo Put it somewhere easy to find
31 More details 13
labibi Documentation Release 10
Now that yoursquove generated your PPK file from your PEM file you can log in To do that
Logging into your EC2 instance with Putty
Open up putty and enter your hostname into the Host Name box
Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo
Log in as ldquoubunturdquo
14 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Declare victory
313 Logging into your new instance ldquoin the cloudrdquo (Mac version)
OK so yoursquove created a running computer How do you get to it
The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet
Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows
First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo
Next start Terminal (in Applications Utilities) and type
chmod og-rwx ~Desktopamazonpem
to set the permissions on the private key file to ldquoclosed to all evildoersrdquo
Then type
ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom
Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop
Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in
31 More details 15
labibi Documentation Release 10
At the end you should see text and a prompt that look like this
A final checklist
bull EC2 instance is running
bull used ubuntu 1404
bull NOT a micro instance (m3xlarge or bigger)
314 Amazon Web Services reference material
Instance types
Instance costs
16 Chapter 3 Getting started with Amazon EC2
CHAPTER 4
Short read quality and trimming
Note Reminder if yoursquore on Windows you should install mobaxterm
OK you should now be logged into your Amazon computer How exciting
41 Prepping the computer
Before we do anything else we need to set up a place to work and install a few things
First letrsquos set up a place to work
sudo chmod a+rwxt mnt
This makes lsquomntrsquo a place where we can put data and working files
Next letrsquos install a few things
sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev
These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below
42 Data source
Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone
43 1 Copying in some data to work with
Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata
mkdir mntdata
Next letrsquos grab part of the data set
17
labibi Documentation Release 10
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz
Now if you type
ls -l
you should see something like
-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz
These are subsets of the original data where we selected for reads that belong to a few particular transcripts
One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further
chmod u-w
Wersquoll talk about what these files are below
44 1 Copying data into a working location
First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data
mkdir mntworkcd mntwork
Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash
ln -fs mntdata
These are FASTQ files ndash letrsquos take a look at them
less 0Hour_ATCACG_L002_R1_001extractfastqgz
(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)
Question
bull why do the files have DNA in the name
bull why are there R1 and R2 in the file names
bull why donrsquot we combine all the files
Links
bull FASTQ Format
45 2 FastQC
Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above
Now run FastQC on two files
18 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz
Now type lsquolsrsquo
ls -d fastqc
to list the files and you should see
0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip
We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you
bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml
Questions
bull What should you pay attention to in the FastQC report
bull Which is ldquobetterrdquo R1 or R2 And why
Links
bull FastQC
bull FastQC tutorial video
46 3 Trimmomatic
Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get
The first thing wersquoll need are the adapters to trim off
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa
Now to run Trimmomatic
TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz
0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully
46 3 Trimmomatic 19
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
CHAPTER 3
Getting started with Amazon EC2
Short version
bull go to httpawsamazoncom log in then ldquoEC2rdquo (upper left)
bull select ldquoLaunch instancerdquo
bull select ldquoUbuntu 1404rdquo from the list
bull select ldquom3xlargerdquo from the list (towards bottom of ldquoGeneral purposerdquo)
bull click ldquoReview and launchrdquo
bull select ldquoLaunchrdquo
bull if your first time through create a key pair otherwise select existing
bull click ldquolaunch instancerdquo
31 More details
311 Start up an EC2 instance
Log in
Go to lsquohttpsawsamazoncomlsquo in a Web browser
Select lsquoMy AccountConsolersquo menu option lsquoAWS Management Consolerdquo
Log in with your username amp password
Click on EC2 (upper left)
7
labibi Documentation Release 10
Select your zone
Many of the resources that we use are hosted by Amazon on the East coast Make sure that your dashboard has lsquoNVirginiarsquo on the upper right
Then click on Launch Instance
Select the machine operating system to boot
Find the ldquoUbuntu Server 1404rdquo image in the first list to show up
8 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Choose the machine size
Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo
Confirm and launch
Review the details (ignore the warnings) and click on Launch
31 More details 9
labibi Documentation Release 10
(First time through) generate a new key pair
If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance
(Next times through) select an existing key pair
Select a key pair and click lsquoLaunchrsquo
10 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Click on View Instances
31 More details 11
labibi Documentation Release 10
Select the public DNS name for later use
312 Logging into your new instance ldquoin the cloudrdquo (Windows version)
First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml
Generate a ppk file from your pem file
(You only need to do this once for each key)
Open puttygen select ldquoLoadrdquo
12 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom
Load it
Now ldquosave private keyrdquo Put it somewhere easy to find
31 More details 13
labibi Documentation Release 10
Now that yoursquove generated your PPK file from your PEM file you can log in To do that
Logging into your EC2 instance with Putty
Open up putty and enter your hostname into the Host Name box
Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo
Log in as ldquoubunturdquo
14 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Declare victory
313 Logging into your new instance ldquoin the cloudrdquo (Mac version)
OK so yoursquove created a running computer How do you get to it
The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet
Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows
First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo
Next start Terminal (in Applications Utilities) and type
chmod og-rwx ~Desktopamazonpem
to set the permissions on the private key file to ldquoclosed to all evildoersrdquo
Then type
ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom
Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop
Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in
31 More details 15
labibi Documentation Release 10
At the end you should see text and a prompt that look like this
A final checklist
bull EC2 instance is running
bull used ubuntu 1404
bull NOT a micro instance (m3xlarge or bigger)
314 Amazon Web Services reference material
Instance types
Instance costs
16 Chapter 3 Getting started with Amazon EC2
CHAPTER 4
Short read quality and trimming
Note Reminder if yoursquore on Windows you should install mobaxterm
OK you should now be logged into your Amazon computer How exciting
41 Prepping the computer
Before we do anything else we need to set up a place to work and install a few things
First letrsquos set up a place to work
sudo chmod a+rwxt mnt
This makes lsquomntrsquo a place where we can put data and working files
Next letrsquos install a few things
sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev
These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below
42 Data source
Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone
43 1 Copying in some data to work with
Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata
mkdir mntdata
Next letrsquos grab part of the data set
17
labibi Documentation Release 10
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz
Now if you type
ls -l
you should see something like
-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz
These are subsets of the original data where we selected for reads that belong to a few particular transcripts
One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further
chmod u-w
Wersquoll talk about what these files are below
44 1 Copying data into a working location
First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data
mkdir mntworkcd mntwork
Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash
ln -fs mntdata
These are FASTQ files ndash letrsquos take a look at them
less 0Hour_ATCACG_L002_R1_001extractfastqgz
(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)
Question
bull why do the files have DNA in the name
bull why are there R1 and R2 in the file names
bull why donrsquot we combine all the files
Links
bull FASTQ Format
45 2 FastQC
Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above
Now run FastQC on two files
18 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz
Now type lsquolsrsquo
ls -d fastqc
to list the files and you should see
0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip
We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you
bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml
Questions
bull What should you pay attention to in the FastQC report
bull Which is ldquobetterrdquo R1 or R2 And why
Links
bull FastQC
bull FastQC tutorial video
46 3 Trimmomatic
Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get
The first thing wersquoll need are the adapters to trim off
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa
Now to run Trimmomatic
TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz
0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully
46 3 Trimmomatic 19
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
Select your zone
Many of the resources that we use are hosted by Amazon on the East coast Make sure that your dashboard has lsquoNVirginiarsquo on the upper right
Then click on Launch Instance
Select the machine operating system to boot
Find the ldquoUbuntu Server 1404rdquo image in the first list to show up
8 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Choose the machine size
Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo
Confirm and launch
Review the details (ignore the warnings) and click on Launch
31 More details 9
labibi Documentation Release 10
(First time through) generate a new key pair
If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance
(Next times through) select an existing key pair
Select a key pair and click lsquoLaunchrsquo
10 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Click on View Instances
31 More details 11
labibi Documentation Release 10
Select the public DNS name for later use
312 Logging into your new instance ldquoin the cloudrdquo (Windows version)
First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml
Generate a ppk file from your pem file
(You only need to do this once for each key)
Open puttygen select ldquoLoadrdquo
12 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom
Load it
Now ldquosave private keyrdquo Put it somewhere easy to find
31 More details 13
labibi Documentation Release 10
Now that yoursquove generated your PPK file from your PEM file you can log in To do that
Logging into your EC2 instance with Putty
Open up putty and enter your hostname into the Host Name box
Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo
Log in as ldquoubunturdquo
14 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Declare victory
313 Logging into your new instance ldquoin the cloudrdquo (Mac version)
OK so yoursquove created a running computer How do you get to it
The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet
Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows
First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo
Next start Terminal (in Applications Utilities) and type
chmod og-rwx ~Desktopamazonpem
to set the permissions on the private key file to ldquoclosed to all evildoersrdquo
Then type
ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom
Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop
Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in
31 More details 15
labibi Documentation Release 10
At the end you should see text and a prompt that look like this
A final checklist
bull EC2 instance is running
bull used ubuntu 1404
bull NOT a micro instance (m3xlarge or bigger)
314 Amazon Web Services reference material
Instance types
Instance costs
16 Chapter 3 Getting started with Amazon EC2
CHAPTER 4
Short read quality and trimming
Note Reminder if yoursquore on Windows you should install mobaxterm
OK you should now be logged into your Amazon computer How exciting
41 Prepping the computer
Before we do anything else we need to set up a place to work and install a few things
First letrsquos set up a place to work
sudo chmod a+rwxt mnt
This makes lsquomntrsquo a place where we can put data and working files
Next letrsquos install a few things
sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev
These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below
42 Data source
Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone
43 1 Copying in some data to work with
Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata
mkdir mntdata
Next letrsquos grab part of the data set
17
labibi Documentation Release 10
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz
Now if you type
ls -l
you should see something like
-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz
These are subsets of the original data where we selected for reads that belong to a few particular transcripts
One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further
chmod u-w
Wersquoll talk about what these files are below
44 1 Copying data into a working location
First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data
mkdir mntworkcd mntwork
Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash
ln -fs mntdata
These are FASTQ files ndash letrsquos take a look at them
less 0Hour_ATCACG_L002_R1_001extractfastqgz
(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)
Question
bull why do the files have DNA in the name
bull why are there R1 and R2 in the file names
bull why donrsquot we combine all the files
Links
bull FASTQ Format
45 2 FastQC
Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above
Now run FastQC on two files
18 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz
Now type lsquolsrsquo
ls -d fastqc
to list the files and you should see
0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip
We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you
bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml
Questions
bull What should you pay attention to in the FastQC report
bull Which is ldquobetterrdquo R1 or R2 And why
Links
bull FastQC
bull FastQC tutorial video
46 3 Trimmomatic
Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get
The first thing wersquoll need are the adapters to trim off
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa
Now to run Trimmomatic
TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz
0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully
46 3 Trimmomatic 19
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
Choose the machine size
Select lsquoGeneral purposersquo lsquom3xlargersquo and then lsquoReview and Launchrsquo
Confirm and launch
Review the details (ignore the warnings) and click on Launch
31 More details 9
labibi Documentation Release 10
(First time through) generate a new key pair
If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance
(Next times through) select an existing key pair
Select a key pair and click lsquoLaunchrsquo
10 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Click on View Instances
31 More details 11
labibi Documentation Release 10
Select the public DNS name for later use
312 Logging into your new instance ldquoin the cloudrdquo (Windows version)
First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml
Generate a ppk file from your pem file
(You only need to do this once for each key)
Open puttygen select ldquoLoadrdquo
12 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom
Load it
Now ldquosave private keyrdquo Put it somewhere easy to find
31 More details 13
labibi Documentation Release 10
Now that yoursquove generated your PPK file from your PEM file you can log in To do that
Logging into your EC2 instance with Putty
Open up putty and enter your hostname into the Host Name box
Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo
Log in as ldquoubunturdquo
14 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Declare victory
313 Logging into your new instance ldquoin the cloudrdquo (Mac version)
OK so yoursquove created a running computer How do you get to it
The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet
Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows
First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo
Next start Terminal (in Applications Utilities) and type
chmod og-rwx ~Desktopamazonpem
to set the permissions on the private key file to ldquoclosed to all evildoersrdquo
Then type
ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom
Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop
Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in
31 More details 15
labibi Documentation Release 10
At the end you should see text and a prompt that look like this
A final checklist
bull EC2 instance is running
bull used ubuntu 1404
bull NOT a micro instance (m3xlarge or bigger)
314 Amazon Web Services reference material
Instance types
Instance costs
16 Chapter 3 Getting started with Amazon EC2
CHAPTER 4
Short read quality and trimming
Note Reminder if yoursquore on Windows you should install mobaxterm
OK you should now be logged into your Amazon computer How exciting
41 Prepping the computer
Before we do anything else we need to set up a place to work and install a few things
First letrsquos set up a place to work
sudo chmod a+rwxt mnt
This makes lsquomntrsquo a place where we can put data and working files
Next letrsquos install a few things
sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev
These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below
42 Data source
Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone
43 1 Copying in some data to work with
Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata
mkdir mntdata
Next letrsquos grab part of the data set
17
labibi Documentation Release 10
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz
Now if you type
ls -l
you should see something like
-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz
These are subsets of the original data where we selected for reads that belong to a few particular transcripts
One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further
chmod u-w
Wersquoll talk about what these files are below
44 1 Copying data into a working location
First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data
mkdir mntworkcd mntwork
Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash
ln -fs mntdata
These are FASTQ files ndash letrsquos take a look at them
less 0Hour_ATCACG_L002_R1_001extractfastqgz
(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)
Question
bull why do the files have DNA in the name
bull why are there R1 and R2 in the file names
bull why donrsquot we combine all the files
Links
bull FASTQ Format
45 2 FastQC
Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above
Now run FastQC on two files
18 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz
Now type lsquolsrsquo
ls -d fastqc
to list the files and you should see
0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip
We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you
bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml
Questions
bull What should you pay attention to in the FastQC report
bull Which is ldquobetterrdquo R1 or R2 And why
Links
bull FastQC
bull FastQC tutorial video
46 3 Trimmomatic
Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get
The first thing wersquoll need are the adapters to trim off
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa
Now to run Trimmomatic
TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz
0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully
46 3 Trimmomatic 19
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
(First time through) generate a new key pair
If you donrsquot have any key pairs enter a key pair name and then download a key pair Then click Launch Instance
(Next times through) select an existing key pair
Select a key pair and click lsquoLaunchrsquo
10 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Click on View Instances
31 More details 11
labibi Documentation Release 10
Select the public DNS name for later use
312 Logging into your new instance ldquoin the cloudrdquo (Windows version)
First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml
Generate a ppk file from your pem file
(You only need to do this once for each key)
Open puttygen select ldquoLoadrdquo
12 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom
Load it
Now ldquosave private keyrdquo Put it somewhere easy to find
31 More details 13
labibi Documentation Release 10
Now that yoursquove generated your PPK file from your PEM file you can log in To do that
Logging into your EC2 instance with Putty
Open up putty and enter your hostname into the Host Name box
Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo
Log in as ldquoubunturdquo
14 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Declare victory
313 Logging into your new instance ldquoin the cloudrdquo (Mac version)
OK so yoursquove created a running computer How do you get to it
The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet
Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows
First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo
Next start Terminal (in Applications Utilities) and type
chmod og-rwx ~Desktopamazonpem
to set the permissions on the private key file to ldquoclosed to all evildoersrdquo
Then type
ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom
Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop
Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in
31 More details 15
labibi Documentation Release 10
At the end you should see text and a prompt that look like this
A final checklist
bull EC2 instance is running
bull used ubuntu 1404
bull NOT a micro instance (m3xlarge or bigger)
314 Amazon Web Services reference material
Instance types
Instance costs
16 Chapter 3 Getting started with Amazon EC2
CHAPTER 4
Short read quality and trimming
Note Reminder if yoursquore on Windows you should install mobaxterm
OK you should now be logged into your Amazon computer How exciting
41 Prepping the computer
Before we do anything else we need to set up a place to work and install a few things
First letrsquos set up a place to work
sudo chmod a+rwxt mnt
This makes lsquomntrsquo a place where we can put data and working files
Next letrsquos install a few things
sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev
These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below
42 Data source
Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone
43 1 Copying in some data to work with
Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata
mkdir mntdata
Next letrsquos grab part of the data set
17
labibi Documentation Release 10
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz
Now if you type
ls -l
you should see something like
-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz
These are subsets of the original data where we selected for reads that belong to a few particular transcripts
One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further
chmod u-w
Wersquoll talk about what these files are below
44 1 Copying data into a working location
First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data
mkdir mntworkcd mntwork
Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash
ln -fs mntdata
These are FASTQ files ndash letrsquos take a look at them
less 0Hour_ATCACG_L002_R1_001extractfastqgz
(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)
Question
bull why do the files have DNA in the name
bull why are there R1 and R2 in the file names
bull why donrsquot we combine all the files
Links
bull FASTQ Format
45 2 FastQC
Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above
Now run FastQC on two files
18 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz
Now type lsquolsrsquo
ls -d fastqc
to list the files and you should see
0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip
We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you
bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml
Questions
bull What should you pay attention to in the FastQC report
bull Which is ldquobetterrdquo R1 or R2 And why
Links
bull FastQC
bull FastQC tutorial video
46 3 Trimmomatic
Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get
The first thing wersquoll need are the adapters to trim off
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa
Now to run Trimmomatic
TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz
0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully
46 3 Trimmomatic 19
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
Click on View Instances
31 More details 11
labibi Documentation Release 10
Select the public DNS name for later use
312 Logging into your new instance ldquoin the cloudrdquo (Windows version)
First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml
Generate a ppk file from your pem file
(You only need to do this once for each key)
Open puttygen select ldquoLoadrdquo
12 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom
Load it
Now ldquosave private keyrdquo Put it somewhere easy to find
31 More details 13
labibi Documentation Release 10
Now that yoursquove generated your PPK file from your PEM file you can log in To do that
Logging into your EC2 instance with Putty
Open up putty and enter your hostname into the Host Name box
Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo
Log in as ldquoubunturdquo
14 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Declare victory
313 Logging into your new instance ldquoin the cloudrdquo (Mac version)
OK so yoursquove created a running computer How do you get to it
The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet
Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows
First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo
Next start Terminal (in Applications Utilities) and type
chmod og-rwx ~Desktopamazonpem
to set the permissions on the private key file to ldquoclosed to all evildoersrdquo
Then type
ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom
Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop
Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in
31 More details 15
labibi Documentation Release 10
At the end you should see text and a prompt that look like this
A final checklist
bull EC2 instance is running
bull used ubuntu 1404
bull NOT a micro instance (m3xlarge or bigger)
314 Amazon Web Services reference material
Instance types
Instance costs
16 Chapter 3 Getting started with Amazon EC2
CHAPTER 4
Short read quality and trimming
Note Reminder if yoursquore on Windows you should install mobaxterm
OK you should now be logged into your Amazon computer How exciting
41 Prepping the computer
Before we do anything else we need to set up a place to work and install a few things
First letrsquos set up a place to work
sudo chmod a+rwxt mnt
This makes lsquomntrsquo a place where we can put data and working files
Next letrsquos install a few things
sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev
These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below
42 Data source
Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone
43 1 Copying in some data to work with
Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata
mkdir mntdata
Next letrsquos grab part of the data set
17
labibi Documentation Release 10
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz
Now if you type
ls -l
you should see something like
-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz
These are subsets of the original data where we selected for reads that belong to a few particular transcripts
One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further
chmod u-w
Wersquoll talk about what these files are below
44 1 Copying data into a working location
First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data
mkdir mntworkcd mntwork
Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash
ln -fs mntdata
These are FASTQ files ndash letrsquos take a look at them
less 0Hour_ATCACG_L002_R1_001extractfastqgz
(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)
Question
bull why do the files have DNA in the name
bull why are there R1 and R2 in the file names
bull why donrsquot we combine all the files
Links
bull FASTQ Format
45 2 FastQC
Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above
Now run FastQC on two files
18 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz
Now type lsquolsrsquo
ls -d fastqc
to list the files and you should see
0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip
We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you
bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml
Questions
bull What should you pay attention to in the FastQC report
bull Which is ldquobetterrdquo R1 or R2 And why
Links
bull FastQC
bull FastQC tutorial video
46 3 Trimmomatic
Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get
The first thing wersquoll need are the adapters to trim off
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa
Now to run Trimmomatic
TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz
0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully
46 3 Trimmomatic 19
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
Select the public DNS name for later use
312 Logging into your new instance ldquoin the cloudrdquo (Windows version)
First Download Putty and Puttygen from here httpwwwchiarkgreenendorguk~sgtathamputtydownloadhtml
Generate a ppk file from your pem file
(You only need to do this once for each key)
Open puttygen select ldquoLoadrdquo
12 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom
Load it
Now ldquosave private keyrdquo Put it somewhere easy to find
31 More details 13
labibi Documentation Release 10
Now that yoursquove generated your PPK file from your PEM file you can log in To do that
Logging into your EC2 instance with Putty
Open up putty and enter your hostname into the Host Name box
Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo
Log in as ldquoubunturdquo
14 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Declare victory
313 Logging into your new instance ldquoin the cloudrdquo (Mac version)
OK so yoursquove created a running computer How do you get to it
The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet
Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows
First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo
Next start Terminal (in Applications Utilities) and type
chmod og-rwx ~Desktopamazonpem
to set the permissions on the private key file to ldquoclosed to all evildoersrdquo
Then type
ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom
Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop
Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in
31 More details 15
labibi Documentation Release 10
At the end you should see text and a prompt that look like this
A final checklist
bull EC2 instance is running
bull used ubuntu 1404
bull NOT a micro instance (m3xlarge or bigger)
314 Amazon Web Services reference material
Instance types
Instance costs
16 Chapter 3 Getting started with Amazon EC2
CHAPTER 4
Short read quality and trimming
Note Reminder if yoursquore on Windows you should install mobaxterm
OK you should now be logged into your Amazon computer How exciting
41 Prepping the computer
Before we do anything else we need to set up a place to work and install a few things
First letrsquos set up a place to work
sudo chmod a+rwxt mnt
This makes lsquomntrsquo a place where we can put data and working files
Next letrsquos install a few things
sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev
These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below
42 Data source
Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone
43 1 Copying in some data to work with
Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata
mkdir mntdata
Next letrsquos grab part of the data set
17
labibi Documentation Release 10
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz
Now if you type
ls -l
you should see something like
-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz
These are subsets of the original data where we selected for reads that belong to a few particular transcripts
One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further
chmod u-w
Wersquoll talk about what these files are below
44 1 Copying data into a working location
First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data
mkdir mntworkcd mntwork
Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash
ln -fs mntdata
These are FASTQ files ndash letrsquos take a look at them
less 0Hour_ATCACG_L002_R1_001extractfastqgz
(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)
Question
bull why do the files have DNA in the name
bull why are there R1 and R2 in the file names
bull why donrsquot we combine all the files
Links
bull FASTQ Format
45 2 FastQC
Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above
Now run FastQC on two files
18 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz
Now type lsquolsrsquo
ls -d fastqc
to list the files and you should see
0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip
We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you
bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml
Questions
bull What should you pay attention to in the FastQC report
bull Which is ldquobetterrdquo R1 or R2 And why
Links
bull FastQC
bull FastQC tutorial video
46 3 Trimmomatic
Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get
The first thing wersquoll need are the adapters to trim off
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa
Now to run Trimmomatic
TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz
0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully
46 3 Trimmomatic 19
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
Find and load your lsquopemrsquo file itrsquos probably in your Downloads folder Note you have to select lsquoAll filesrsquo on thebottom
Load it
Now ldquosave private keyrdquo Put it somewhere easy to find
31 More details 13
labibi Documentation Release 10
Now that yoursquove generated your PPK file from your PEM file you can log in To do that
Logging into your EC2 instance with Putty
Open up putty and enter your hostname into the Host Name box
Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo
Log in as ldquoubunturdquo
14 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Declare victory
313 Logging into your new instance ldquoin the cloudrdquo (Mac version)
OK so yoursquove created a running computer How do you get to it
The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet
Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows
First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo
Next start Terminal (in Applications Utilities) and type
chmod og-rwx ~Desktopamazonpem
to set the permissions on the private key file to ldquoclosed to all evildoersrdquo
Then type
ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom
Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop
Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in
31 More details 15
labibi Documentation Release 10
At the end you should see text and a prompt that look like this
A final checklist
bull EC2 instance is running
bull used ubuntu 1404
bull NOT a micro instance (m3xlarge or bigger)
314 Amazon Web Services reference material
Instance types
Instance costs
16 Chapter 3 Getting started with Amazon EC2
CHAPTER 4
Short read quality and trimming
Note Reminder if yoursquore on Windows you should install mobaxterm
OK you should now be logged into your Amazon computer How exciting
41 Prepping the computer
Before we do anything else we need to set up a place to work and install a few things
First letrsquos set up a place to work
sudo chmod a+rwxt mnt
This makes lsquomntrsquo a place where we can put data and working files
Next letrsquos install a few things
sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev
These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below
42 Data source
Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone
43 1 Copying in some data to work with
Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata
mkdir mntdata
Next letrsquos grab part of the data set
17
labibi Documentation Release 10
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz
Now if you type
ls -l
you should see something like
-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz
These are subsets of the original data where we selected for reads that belong to a few particular transcripts
One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further
chmod u-w
Wersquoll talk about what these files are below
44 1 Copying data into a working location
First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data
mkdir mntworkcd mntwork
Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash
ln -fs mntdata
These are FASTQ files ndash letrsquos take a look at them
less 0Hour_ATCACG_L002_R1_001extractfastqgz
(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)
Question
bull why do the files have DNA in the name
bull why are there R1 and R2 in the file names
bull why donrsquot we combine all the files
Links
bull FASTQ Format
45 2 FastQC
Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above
Now run FastQC on two files
18 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz
Now type lsquolsrsquo
ls -d fastqc
to list the files and you should see
0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip
We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you
bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml
Questions
bull What should you pay attention to in the FastQC report
bull Which is ldquobetterrdquo R1 or R2 And why
Links
bull FastQC
bull FastQC tutorial video
46 3 Trimmomatic
Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get
The first thing wersquoll need are the adapters to trim off
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa
Now to run Trimmomatic
TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz
0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully
46 3 Trimmomatic 19
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
Now that yoursquove generated your PPK file from your PEM file you can log in To do that
Logging into your EC2 instance with Putty
Open up putty and enter your hostname into the Host Name box
Now go find the lsquoSSHrsquo section and enter your ppk file (generated above by puttygen) Then select lsquoOpenrsquo
Log in as ldquoubunturdquo
14 Chapter 3 Getting started with Amazon EC2
labibi Documentation Release 10
Declare victory
313 Logging into your new instance ldquoin the cloudrdquo (Mac version)
OK so yoursquove created a running computer How do you get to it
The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet
Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows
First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo
Next start Terminal (in Applications Utilities) and type
chmod og-rwx ~Desktopamazonpem
to set the permissions on the private key file to ldquoclosed to all evildoersrdquo
Then type
ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom
Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop
Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in
31 More details 15
labibi Documentation Release 10
At the end you should see text and a prompt that look like this
A final checklist
bull EC2 instance is running
bull used ubuntu 1404
bull NOT a micro instance (m3xlarge or bigger)
314 Amazon Web Services reference material
Instance types
Instance costs
16 Chapter 3 Getting started with Amazon EC2
CHAPTER 4
Short read quality and trimming
Note Reminder if yoursquore on Windows you should install mobaxterm
OK you should now be logged into your Amazon computer How exciting
41 Prepping the computer
Before we do anything else we need to set up a place to work and install a few things
First letrsquos set up a place to work
sudo chmod a+rwxt mnt
This makes lsquomntrsquo a place where we can put data and working files
Next letrsquos install a few things
sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev
These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below
42 Data source
Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone
43 1 Copying in some data to work with
Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata
mkdir mntdata
Next letrsquos grab part of the data set
17
labibi Documentation Release 10
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz
Now if you type
ls -l
you should see something like
-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz
These are subsets of the original data where we selected for reads that belong to a few particular transcripts
One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further
chmod u-w
Wersquoll talk about what these files are below
44 1 Copying data into a working location
First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data
mkdir mntworkcd mntwork
Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash
ln -fs mntdata
These are FASTQ files ndash letrsquos take a look at them
less 0Hour_ATCACG_L002_R1_001extractfastqgz
(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)
Question
bull why do the files have DNA in the name
bull why are there R1 and R2 in the file names
bull why donrsquot we combine all the files
Links
bull FASTQ Format
45 2 FastQC
Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above
Now run FastQC on two files
18 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz
Now type lsquolsrsquo
ls -d fastqc
to list the files and you should see
0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip
We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you
bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml
Questions
bull What should you pay attention to in the FastQC report
bull Which is ldquobetterrdquo R1 or R2 And why
Links
bull FastQC
bull FastQC tutorial video
46 3 Trimmomatic
Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get
The first thing wersquoll need are the adapters to trim off
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa
Now to run Trimmomatic
TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz
0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully
46 3 Trimmomatic 19
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
Declare victory
313 Logging into your new instance ldquoin the cloudrdquo (Mac version)
OK so yoursquove created a running computer How do you get to it
The main thing yoursquoll need is the network name of your new computer To retrieve this go to the instance view andclick on the instance and find the ldquoPublic DNSrdquo This is the public name of your computer on the Internet
Copy this name and connect to that computer with ssh under the username lsquoubuntursquo as follows
First find your private key file itrsquos the pem file you downloaded when starting up your EC2 instance It should be inyour Downloads folder Move it onto your desktop and rename it to lsquoamazonpemrsquo
Next start Terminal (in Applications Utilities) and type
chmod og-rwx ~Desktopamazonpem
to set the permissions on the private key file to ldquoclosed to all evildoersrdquo
Then type
ssh -i ~Desktopamazonpem ubuntuec2----compute-1amazonawscom
Here yoursquore logging in as user lsquoubuntursquo to the machine lsquoec2-174-129-122-189compute-1amazonawscomrsquo using theauthentication key located in lsquoamazonpemrsquo on your Desktop
Note you have to replace the stuff after the lsquorsquo sign with the name of the host see the red circle in
31 More details 15
labibi Documentation Release 10
At the end you should see text and a prompt that look like this
A final checklist
bull EC2 instance is running
bull used ubuntu 1404
bull NOT a micro instance (m3xlarge or bigger)
314 Amazon Web Services reference material
Instance types
Instance costs
16 Chapter 3 Getting started with Amazon EC2
CHAPTER 4
Short read quality and trimming
Note Reminder if yoursquore on Windows you should install mobaxterm
OK you should now be logged into your Amazon computer How exciting
41 Prepping the computer
Before we do anything else we need to set up a place to work and install a few things
First letrsquos set up a place to work
sudo chmod a+rwxt mnt
This makes lsquomntrsquo a place where we can put data and working files
Next letrsquos install a few things
sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev
These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below
42 Data source
Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone
43 1 Copying in some data to work with
Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata
mkdir mntdata
Next letrsquos grab part of the data set
17
labibi Documentation Release 10
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz
Now if you type
ls -l
you should see something like
-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz
These are subsets of the original data where we selected for reads that belong to a few particular transcripts
One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further
chmod u-w
Wersquoll talk about what these files are below
44 1 Copying data into a working location
First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data
mkdir mntworkcd mntwork
Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash
ln -fs mntdata
These are FASTQ files ndash letrsquos take a look at them
less 0Hour_ATCACG_L002_R1_001extractfastqgz
(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)
Question
bull why do the files have DNA in the name
bull why are there R1 and R2 in the file names
bull why donrsquot we combine all the files
Links
bull FASTQ Format
45 2 FastQC
Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above
Now run FastQC on two files
18 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz
Now type lsquolsrsquo
ls -d fastqc
to list the files and you should see
0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip
We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you
bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml
Questions
bull What should you pay attention to in the FastQC report
bull Which is ldquobetterrdquo R1 or R2 And why
Links
bull FastQC
bull FastQC tutorial video
46 3 Trimmomatic
Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get
The first thing wersquoll need are the adapters to trim off
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa
Now to run Trimmomatic
TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz
0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully
46 3 Trimmomatic 19
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
At the end you should see text and a prompt that look like this
A final checklist
bull EC2 instance is running
bull used ubuntu 1404
bull NOT a micro instance (m3xlarge or bigger)
314 Amazon Web Services reference material
Instance types
Instance costs
16 Chapter 3 Getting started with Amazon EC2
CHAPTER 4
Short read quality and trimming
Note Reminder if yoursquore on Windows you should install mobaxterm
OK you should now be logged into your Amazon computer How exciting
41 Prepping the computer
Before we do anything else we need to set up a place to work and install a few things
First letrsquos set up a place to work
sudo chmod a+rwxt mnt
This makes lsquomntrsquo a place where we can put data and working files
Next letrsquos install a few things
sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev
These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below
42 Data source
Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone
43 1 Copying in some data to work with
Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata
mkdir mntdata
Next letrsquos grab part of the data set
17
labibi Documentation Release 10
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz
Now if you type
ls -l
you should see something like
-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz
These are subsets of the original data where we selected for reads that belong to a few particular transcripts
One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further
chmod u-w
Wersquoll talk about what these files are below
44 1 Copying data into a working location
First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data
mkdir mntworkcd mntwork
Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash
ln -fs mntdata
These are FASTQ files ndash letrsquos take a look at them
less 0Hour_ATCACG_L002_R1_001extractfastqgz
(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)
Question
bull why do the files have DNA in the name
bull why are there R1 and R2 in the file names
bull why donrsquot we combine all the files
Links
bull FASTQ Format
45 2 FastQC
Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above
Now run FastQC on two files
18 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz
Now type lsquolsrsquo
ls -d fastqc
to list the files and you should see
0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip
We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you
bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml
Questions
bull What should you pay attention to in the FastQC report
bull Which is ldquobetterrdquo R1 or R2 And why
Links
bull FastQC
bull FastQC tutorial video
46 3 Trimmomatic
Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get
The first thing wersquoll need are the adapters to trim off
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa
Now to run Trimmomatic
TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz
0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully
46 3 Trimmomatic 19
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
CHAPTER 4
Short read quality and trimming
Note Reminder if yoursquore on Windows you should install mobaxterm
OK you should now be logged into your Amazon computer How exciting
41 Prepping the computer
Before we do anything else we need to set up a place to work and install a few things
First letrsquos set up a place to work
sudo chmod a+rwxt mnt
This makes lsquomntrsquo a place where we can put data and working files
Next letrsquos install a few things
sudo apt-get updatesudo apt-get install -y trimmomatic fastqc python-pip python-dev
These are the Trimmomatic and FastQC programs which wersquoll use below along with some software prerequisites thatwersquoll need for other things below
42 Data source
Wersquore going to be using a subset of data from Tulin et al 2013 a paper looking at early transcription in the organismNematostella vectensis the sea anemone
43 1 Copying in some data to work with
Wersquove loaded subsets of the data onto an Amazon location for you to make everything faster for todayrsquos work Wersquoregoing to put the files on your computer locally under the directory mntdata
mkdir mntdata
Next letrsquos grab part of the data set
17
labibi Documentation Release 10
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz
Now if you type
ls -l
you should see something like
-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz
These are subsets of the original data where we selected for reads that belong to a few particular transcripts
One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further
chmod u-w
Wersquoll talk about what these files are below
44 1 Copying data into a working location
First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data
mkdir mntworkcd mntwork
Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash
ln -fs mntdata
These are FASTQ files ndash letrsquos take a look at them
less 0Hour_ATCACG_L002_R1_001extractfastqgz
(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)
Question
bull why do the files have DNA in the name
bull why are there R1 and R2 in the file names
bull why donrsquot we combine all the files
Links
bull FASTQ Format
45 2 FastQC
Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above
Now run FastQC on two files
18 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz
Now type lsquolsrsquo
ls -d fastqc
to list the files and you should see
0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip
We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you
bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml
Questions
bull What should you pay attention to in the FastQC report
bull Which is ldquobetterrdquo R1 or R2 And why
Links
bull FastQC
bull FastQC tutorial video
46 3 Trimmomatic
Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get
The first thing wersquoll need are the adapters to trim off
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa
Now to run Trimmomatic
TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz
0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully
46 3 Trimmomatic 19
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_001extractfastqgz
Now if you type
ls -l
you should see something like
-r--r--r-- 1 ubuntu ubuntu 7874107 Dec 14 2013 0Hour_ATCACG_L002_R1_001extractfastqgz-r--r--r-- 1 ubuntu ubuntu 7972058 Dec 14 2013 0Hour_ATCACG_L002_R1_002extractfastqgz
These are subsets of the original data where we selected for reads that belong to a few particular transcripts
One problem with these files is that they are writeable - by default UNIX makes things writeable by the file ownerLetrsquos fix that before we go on any further
chmod u-w
Wersquoll talk about what these files are below
44 1 Copying data into a working location
First make a working directory this will be a place where you can futz around with a copy of the data without messingup your primary data
mkdir mntworkcd mntwork
Now make a ldquovirtual copyrdquo of the data in your working directory by linking it in ndash
ln -fs mntdata
These are FASTQ files ndash letrsquos take a look at them
less 0Hour_ATCACG_L002_R1_001extractfastqgz
(use the spacebar to scroll down and type lsquoqrsquo to exit lsquolessrsquo)
Question
bull why do the files have DNA in the name
bull why are there R1 and R2 in the file names
bull why donrsquot we combine all the files
Links
bull FASTQ Format
45 2 FastQC
Wersquore going to use FastQC to summarize the data We already installed lsquofastqcrsquo on our computer - thatrsquos what thelsquoapt-get installrsquo did above
Now run FastQC on two files
18 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz
Now type lsquolsrsquo
ls -d fastqc
to list the files and you should see
0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip
We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you
bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml
Questions
bull What should you pay attention to in the FastQC report
bull Which is ldquobetterrdquo R1 or R2 And why
Links
bull FastQC
bull FastQC tutorial video
46 3 Trimmomatic
Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get
The first thing wersquoll need are the adapters to trim off
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa
Now to run Trimmomatic
TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz
0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully
46 3 Trimmomatic 19
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
fastqc 0Hour_ATCACG_L002_R1_001extractfastqgzfastqc 0Hour_ATCACG_L002_R2_001extractfastqgz
Now type lsquolsrsquo
ls -d fastqc
to list the files and you should see
0Hour_ATCACG_L002_R1_001extract_fastqc0Hour_ATCACG_L002_R1_001extract_fastqczip0Hour_ATCACG_L002_R2_001extract_fastqc0Hour_ATCACG_L002_R2_001extract_fastqczip
We are not going to show you how to look at these files right now - you need to copy them to your local computer todo that Wersquoll show you that tomorrow But we can show you what they look like because Irsquove made copiesd of themfor you
bull 0Hour_ATCACG_L002_R1_001extract_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001extract_fastqcfastqc_reporthtml
Questions
bull What should you pay attention to in the FastQC report
bull Which is ldquobetterrdquo R1 or R2 And why
Links
bull FastQC
bull FastQC tutorial video
46 3 Trimmomatic
Now wersquore going to do some trimming Wersquoll be using Trimmomatic which (as with fastqc) wersquove already installedvia apt-get
The first thing wersquoll need are the adapters to trim off
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-semi-2015-03-04TruSeq2-PEfa
Now to run Trimmomatic
TrimmomaticPE 0Hour_ATCACG_L002_R1_001extractfastqgz 0Hour_ATCACG_L002_R2_001extractfastqgz
0Hour_ATCACG_L002_R1_001qcfqgz s1_se 0Hour_ATCACG_L002_R2_001qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
You should see output that looks like thisQuality encoding detected as phred33Input Read Pairs 140557 Both Surviving 138775 (9873) Forward Only Surviving 1776 (126) Reverse Only Surviving 6 (000) Dropped 0 (000)TrimmomaticPE Completed successfully
46 3 Trimmomatic 19
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
Questions
bull How do you figure out what the parameters mean
bull How do you figure out what parameters to use
bull What adapters do you use
bull What version of Trimmomatic are we using here (And FastQC)
bull Do you think parameters are different for RNAseq and genomic data sets
bull Whatrsquos with these annoyingly long and complicated filenames
bull why are we running R1 and R2 together
For a discussion of optimal RNAseq trimming strategies see MacManes 2014
Links
bull Trimmomatic
47 4 FastQC again
Run FastQC again on the trimmed files
fastqc 0Hour_ATCACG_L002_R1_001qcfqgzfastqc 0Hour_ATCACG_L002_R2_001qcfqgz
And now view my copies of these files
bull 0Hour_ATCACG_L002_R1_001qcfq_fastqcfastqc_reporthtml
bull 0Hour_ATCACG_L002_R2_001qcfq_fastqcfastqc_reporthtml
Letrsquos take a look at the output files
less 0Hour_ATCACG_L002_R1_001qcfqgz
(again use spacebar to scroll lsquoqrsquo to exit less)
Questions
bull is the quality trimmed data ldquobetterrdquo than before
bull Does it matter that you still have adapters
48 5 Trim the rest of the sequences
First download the rest of the data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgz
20 Chapter 4 Short read quality and trimming
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
curl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
And link it in
cd mntworkln -fs mntdatafastqgz
Now we have a lot of files ndash and we really donrsquot want to trim each and every one of them by typing in a command foreach pair Here wersquoll make use of a great feature of the UNIX command line ndash the ability to automate such tasks
Herersquos a for loop that you can run - wersquoll walk through what it does while itrsquos running
rm -f orphansfq
for filename in _R1_extractfastqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename extractfastqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
finally run TrimmomaticTrimmomaticPE $baseextractfastqgz $baseR2extractfastqgz
$baseqcfqgz s1_se $baseR2qcfqgz s2_se ILLUMINACLIPTruSeq2-PEfa24015 LEADING2 TRAILING2 SLIDINGWINDOW42 MINLEN25
save the orphanscat s1_se s2_se gtgt orphansfq
done
Things to mention ndash
bull are comments
bull anywhere you see a lsquo$rsquo is replaced by the value of the variable after it so eg $filename is replaced by each ofthe files matching _R1_extractfastqgz once for each time through the loop
bull we have to do complicated things to the filenames to get this to work which is what the $base_R1__R2_stuff is about
bull whatrsquos with lsquoorphansfqrsquo
Questions
bull how do you figure out if itrsquos working
ndash copypaste it from Word
48 5 Trim the rest of the sequences 21
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
ndash put in lots of echo
ndash edit one line at a time
bull how on earth do you figure out how to do this
49 6 Interleave the sequences
Next we need to take these R1 and R2 sequences and convert them into interleaved form for the next step To do thiswersquoll use scripts from the khmer package which we need to install
sudo pip install -U setuptoolssudo pip install khmer==13
Now letrsquos use a for loop again - you might notice this is only a minor modification of the previous for loop
for filename in _R1_qcfqgzdo
first make the base by removing extractfastqgzbase=$(basename $filename qcfqgz)echo $base
now construct the R2 filename by replacing R1 with R2baseR2=$base_R1__R2_echo $baseR2
construct the output filenameoutput=$base_R1_peqcfqgz
interleave-readspy $baseqcfqgz $baseR2qcfqgz | gzip gt $output
done
gzip orphansfq
Next Running digital normalization
22 Chapter 4 Short read quality and trimming
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
CHAPTER 5
Running digital normalization
Next wersquore going to apply abundance normalization to the data ndash known as ldquodigital normalizationrdquo this approach wasdeveloped by our lab to make it possible to assemble large data sets more quickly and easily You can read more aboutit in Brown et al 2012 and also see some of its affects on transcriptome assembly in Lowe et al 2014
Digital normalization works by eliminating high abundance reads that are unnecessary for assembly
First wersquoll run it on the interleaved files we generated in the previous section
cd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
(These parameters should work for essentially all mRNAseq data sets see the khmer documentation for more infor-mation)
Next run diginorm on the orphaned reads (from trimming)
normalize-by-medianpy -l normC20k20ct -s normC20k20ct orphansfqgz
Do k-mer abundance trimming on the reads which will eliminate the majority of the errors (thus further decreasingthe memory requirements) ndash
filter-abundpy -V normC20k20ct keep
See our paper Zhang et al 2014 lthttpwwwncbinlmnihgovpubmed25062443gtlsquo__ Table 3 for more informationon k-mer trimming effects
Now take all of the paired-end files and split them into paired and orphaned reads
for filename in pekeepabundfiltdo
extract-paired-readspy $filenamedone
Put all the orphaned reads in one place
cat se orphansfqgzkeepabundfilt | gzip gt orphansdnfqgz
And now rename the paired-end files to something nice
for filename in peqcfqgzkeepabundfiltpedo
base=$(basename $filename peqcfqgzkeepabundfiltpe)output=$basednfqgzgzip -c $filename gt $output
done
23
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
Now if you type
ls dnfqgz
yoursquoll see all of the files that you need to move on to the next step ndash
0Hour_ATCACG_L002001dnfqgz 6Hour_CGATGT_L002002dnfqgz0Hour_ATCACG_L002002dnfqgz 6Hour_CGATGT_L002003dnfqgz0Hour_ATCACG_L002003dnfqgz 6Hour_CGATGT_L002004dnfqgz0Hour_ATCACG_L002004dnfqgz 6Hour_CGATGT_L002005dnfqgz0Hour_ATCACG_L002005dnfqgz orphansdnfqgz6Hour_CGATGT_L002001dnfqgz
Letrsquos remove some of the detritus before moving on
rm pe se abundfilt keeprm normC20k20ct
Next Running the actual assembly
24 Chapter 5 Running digital normalization
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
CHAPTER 6
Running the actual assembly
Now wersquoll assemble all of these reads into a transcriptome using the Trinity de novo transcriptome assembler
First install some prerequisites for Trinity
sudo apt-get -y install bowtie samtools
Next install Trinity v206
cdcurl -L httpsgithubcomtrinityrnaseqtrinityrnaseqarchivev206targz gt trinitytargztar xzf trinitytargzmv trinityrnaseq trinity
cd trinitymake
Go into the work directory and prepare the data
cd mntworkfor i in dnfqgzdo
split-paired-readspy $idone
cat 1 gt leftfqcat 2 gt rightfq
Now run the Trinity assembler
~trinityTrinity --left leftfq --right rightfq --seqType fq --max_memory 10G --bypass_java_version_check
This will give you an output file trinity_out_dirTrinityfasta which you can get stats on like so
curl -L -O httpsgithubcomged-labkhmerrawv13sandboxassemstats3pypython assemstats3py 300 trinity_out_dirTrinityfasta
Change the filename and rename all the sequences
gzip -c trinity_out_dirTrinityfasta gt trinity-nematostella-rawfagzcurl -O http2015-may-nonmodelreadthedocsorgendev_staticrename-with-partitionspychmod u+x rename-with-partitionspyrename-with-partitionspy nema trinity-nematostella-rawfagz
This last command will give you trinity-nematostella-rawrenamedfastagz which contains all ofthe renamed sequences
25
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
Next BLASTing your assembled data
26 Chapter 6 Running the actual assembly
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
CHAPTER 7
BLASTing your assembled data
First install a few prerequisites
sudo apt-get -y install lighttpd blast2 git-core zlib1g-dev
Next grab things needed for the BLAST server
sudo pip install pygr whoosh Pillow Jinja2 git+httpsgithubcomctbpygr-drawgit screed
sudo ln -s usrbinblastall usrlocalbin
Install the BLAST server and configure it
cdgit clone httpsgithubcomctbblastkitgit -b 2015-may-nonmodelsudo blastkitconfigure-lighttpdsh
cd blastkitwwwsudo ln -fs $PWD varwwwblastkit
mkdir fileschmod a+rxwt fileschmod +x homeubuntu
cd homeubuntublastkitpython checkpy
Now copy in your newly created transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt homeubuntublastkitdbdbfa
cd homeubuntublastkitformatdb -i dbdbfa -o T -p Fpython index-dbpy dbdbfa
You can now access your BLAST server at httpltamazon machine namegtblastkit
Note that you will need to enable HTTP access on your Amazon firewall settings see amazonenable-http
If you want to use something that will get you results with the test data set try BLASTing zebrafish lethal giant larvaehomolog
27
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
28 Chapter 7 BLASTing your assembled data
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
CHAPTER 8
Annotation of denovo transcriptome
81 Identify the GeneTranscript relationships
we can generate this file like so
cd mntwork~trinityutilsupport_scriptsget_Trinity_gene_to_trans_mappl trinity_out_dirTrinityfasta gt Trinityfastagene_trans_map
Letrsquos have a look on the map
less Trinityfastagene_trans_map
Components genes and isoforms
bull The different (irsquos) that correspond to the same (g) represent isoforms
bull The different (grsquos) could represent different genes (or parts of genes)
bull The component (TR|c) often contain related genes (paralogs or gene fragments)
Check the Trinityseq forum for more details
82 Generate the longest-ORF peptide candidates from the Trinity As-sembly
We need to install Transdecoder to do this job
cdsudo cpan URIEscape
Note type yes for all interactive questions
curl -L httpsgithubcomTransDecoderTransDecoderarchive201targz gt transdecodertargztar xzf transdecodertargzmv TransDecoder TransDecodercd TransDecodermake
Now we can run the Transdecoder software to identify the longest-ORF peptide
29
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
cd mntwork~TransDecoderTransDecoderLongOrfs -t trinity_out_dirTrinityfasta
Check the Transdecoder output
less Trinityfastatransdecoder_dirlongest_orfspep
83 Capturing BLAST Homologies
Install BLAST+ (httpwwwncbinlmnihgovbooksNBK52640)
sudo apt-get install -y ncbi-blast+
Get the required sequence databases and prepare local blast databases
1 SwissProt database The UniProt Knowledgebase which include the Manually annotated proteins
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_sprottrinotate_v20pepgzmv uniprot_sprottrinotate_v20pepgz uniprot_sprottrinotatepepgzgunzip uniprot_sprottrinotatepepgzmakeblastdb -in uniprot_sprottrinotatepep -dbtype prot
Run blast to find homologies
(a) search Trinity transcripts
blastx -query trinity_out_dirTrinityfasta -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastxoutfmt6
(b) search Transdecoder-predicted proteins
blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_sprottrinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt blastpoutfmt6
2 Optional Uniref90 which provides clustered sets of protein sequences in a way such that each cluster is com-posed of sequences that have at least 90 sequence identity to and 80 overlap with the longest sequence
wget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESuniprot_uniref90trinotate_v20pepgzmv uniprot_uniref90trinotate_v20pepgz uniprot_uniref90trinotatepepgzgunzip uniprot_uniref90trinotatepepgzmakeblastdb -in uniprot_uniref90trinotatepep -dbtype prot
perform similar searches using uniref90 as the target database rename output files accordingly
blastx -query trinity_out_dirTrinityfasta -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastxoutfmt6blastp -query Trinityfastatransdecoder_dirlongest_orfspep -db uniprot_uniref90trinotatepep -num_threads 4 -max_target_seqs 1 -outfmt 6 gt uniref90blastpoutfmt6
I have ran them overnight already You can download these files to save time
wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastpoutfmt6wget httpsgithubcomngs-docs2015-may-nonmodelblobmaster_staticuniref90blastxoutfmt6
84 Characterization of functional annotation features
1 identify protein domains we need to install HMMER and download the Pfam domains database
30 Chapter 8 Annotation of denovo transcriptome
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
sudo apt-get install -y hmmer
Then we can run hmmer to identify the protein domains
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESPfam-Ahmmgzgunzip Pfam-Ahmmgzhmmpress Pfam-Ahmmhmmscan --cpu 4 --domtblout TrinotatePFAMout Pfam-Ahmm Trinityfastatransdecoder_dirlongest_orfspep gt pfamlog
2 We can predict other features like
bull signal peptides using signalP
bull transmembrane regions using tmHMM
bull rRNA transcripts using RNAMMER
85 Integration of all annotations into one database
install Trinotate
cdcurl -L httpsgithubcomTrinotateTrinotatearchivev202targz gt trinotatetargztar xzf trinotatetargzmv Trinotate Trinotate
install sqlite
sudo apt-get install sqlite3
We need also the DBI perl package
sudo cpan DBIsudo cpan DBDSQLite
Retrieve the Trinotate Pre-generated Resource SQLite database A pregenerated sqlite database that contains Uniprot(swissprot and uniref90)-related annotation information is available from the Trinity ftp site
cd mntworkwget ftpftpbroadinstituteorgpubTrinityTrinotate_v20_RESOURCESTrinotatesprot_uniref9020150131boilerplatesqlitegz -O Trinotatesqlitegzgunzip Trinotatesqlitegz
Load transcripts and coding regions We have three data types
1 Transcript sequences (de novo assembled transcripts or reference transcripts)
2 Protein sequences (currently as defined by TransDecoder)
3 GeneTranscript relationships
~TrinotateTrinotate Trinotatesqlite init --gene_trans_map Trinityfastagene_trans_map --transcript_fasta trinity_out_dirTrinityfasta --transdecoder_pep Trinityfastatransdecoder_dirlongest_orfspep
Loading BLAST homologies
~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastp blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_swissprot_blastx blastxoutfmt6
Optional load Uniref90 blast hits
85 Integration of all annotations into one database 31
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastp uniref90blastpoutfmt6~TrinotateTrinotate Trinotatesqlite LOAD_trembl_blastx uniref90blastxoutfmt6
Optional Loading functional annotation features
~TrinotateTrinotate Trinotatesqlite LOAD_pfam TrinotatePFAMout
86 Output an Annotation Report
~TrinotateTrinotate Trinotatesqlite report -E 00001 gt trinotate_annotation_reportxls
There are 2 arguments that we can use to control the accuracy of annotation
-E ltfloatgt maximum E-value for reporting best blast hit and associated annotations
ndashpfam_cutoff ltstringgt
1 lsquoDNCrsquo domain noise cutoff (default)
2 lsquoDGCrsquo domain gathering cutoff
3 lsquoDTCrsquo domain trusted cutoff
4 lsquoSNCrsquo sequence noise cutoff
5 lsquoSGCrsquo sequence gathering cutoff
6 lsquoSTCrsquo sequence trusted cutoff
let us see the output Open a new shell
scp -i YOUR_SECURITY_KEYpem ubuntuYOUR_AMAZONE_INSTANCE_ADDRESSmntworktrinotate_annotation_reportxls
32 Chapter 8 Annotation of denovo transcriptome
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
CHAPTER 9
Quantification and Differential Expression
First make sure yoursquove downloaded all the original raw data
cd mntdatacurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-040Hour_ATCACG_L002_R2_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R1_005extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_001extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_002extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_003extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_004extractfastqgzcurl -O -L httpdib-trainingucdavisedus3amazonawscommRNAseq-non-2015-05-046Hour_CGATGT_L002_R2_005extractfastqgz
and link it in
cd mntworkln -fs mntdatafastqgz
91 Download Express
Now get express
cdcurl -L httpbiomathberkeleyedueXpressdownloadsexpress-151express-151-linux_x86_64tgz gt expresstargztar xzf expresstargz
92 Align Reads with Bowtie
Next build an index file for your assembly
33
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie-build --offrate 1 trinity-nematostella-rawrenamedfasta trinity-nematostella-rawrenamed
Using the index we built wersquoll align the reads from a few of our samples back to our assembly
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_001extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_001extractfastqgz) gt 0Hour_ATCACG_L002_001extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 0Hour_ATCACG_L002_R1_002extractfastqgz) -2 lt(zcat 0Hour_ATCACG_L002_R2_002extractfastqgz) gt 0Hour_ATCACG_L002_002extractsam
bowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_001extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_001extractfastqgz) gt 6Hour_CGATGT_L002_001extractsambowtie -aS -X 800 --offrate 1 trinity-nematostella-rawrenamed -1 lt(zcat 6Hour_CGATGT_L002_R1_002extractfastqgz) -2 lt(zcat 6Hour_CGATGT_L002_R2_002extractfastqgz) gt 6Hour_CGATGT_L002_002extractsam
93 Quantify Expression using eXpress
Finally using eXpress wersquoll get abundance estimates for our transcripts eXpress uses a probabilistic model to ef-ficiently assign mapped reads to isoforms and estimate expression level (see the website for additional details andrelevant publications)
~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 0Hour_ATCACG_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 0Hour_ATCACG_L002_002extractsam
~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_001extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_001extractsam~express-151-linux_x86_64express --no-bias-correct -o 6Hour_CGATGT_L002_002extractsam-express trinity-nematostella-rawrenamedfasta 6Hour_CGATGT_L002_002extractsam
This will put the results in a new set of folders named like ltconditiongt_ltbarcodegt_L002_ltreplicategtextractsam-express Each contains a file called resultsxprs with the results Wersquoll look at the first ten lines of one of the files usingthe head command
head 0Hour_ATCACG_L002_001extractsam-expressresultsxprs
You should see something like this
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm1 nemaid7tr4 269 0000000 0 0 0000000 0000000 0000000e+00 0000000e+00 0000000e+00 0000000e+00 0000000e+00 F 0000000e+002 nemaid1tr1 811 508137307 1301 45 158338092 252711602 4777128e+01 4816246e+02 3073997e+03 2311142e+03 3836852e+03 T 4695471e+032 nemaid2tr1 790 487144836 1845 356 1218927626 1976727972 1111471e+02 8063959e+01 2468419e+04 2254229e+04 2682610e+04 T 3770463e+042 nemaid3tr1 852 549122606 1792 3 871770849 1352610064 5493335e+01 5818711e+01 1566146e+04 1375746e+04 1756546e+04 T 2392257e+042 nemaid4tr1 675 372190166 1005 20 88963433 161343106 2836182e+01 3767281e+02 2358011e+03 1546107e+03 3169914e+03 T 3601816e+033 nemaid62tr13 2150 1846657210 9921 9825 9919902997 11549404689 1704940e+03 1970774e+01 5299321e+04 5281041e+04 5317602e+04 T 8094611e+043 nemaid63tr13 406 103720396 360 270 271097003 1061173959 1934732e+02 1567940e+04 2578456e+04 2417706e+04 2739205e+04 T 3938541e+043 nemaid61tr13 447 144526787 6 0 0000000 0000000 2246567e+04 2246565e+10 3518941e-08 0000000e+00 1296989e-03 T 5375114e-084 nemaid21tr8 2075 1771684102 2782 58 958636395 1122756883 1223148e+02 2476298e+02 5337855e+03 4749180e+03 5926529e+03 T 8153470e+03
94 Differential Expression
First install R and edgeR
sudo apt-get install -y r-base-core r-bioc-edger csvtool
Now we extract the columns we need from the eXpress outputs and convert it to the appropriate format
34 Chapter 9 Quantification and Differential Expression
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
csvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 0Hour_ATCACG_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 0Hour_repl2_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_001extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl1_countstxtcsvtool namedcol -t TAB target_idest_counts 6Hour_CGATGT_L002_002extractsam-expressresultsxprs | csvtool drop 1 -u TAB - gt 6Hour_repl2_countstxt
Wersquoll be using edgeR to do the basic differential expression analysis of our counts
To run edgeR you need to write a data loading and manipulation script in R In this case Irsquove provided one ndashdiff_expR This script will load in two samples with two replicates execute an MA plot do an MDS analysisplotand provide a spreadsheet with differential expression information in it
Links
bull False Discovery Rate
bull Learn R with Swirl
So download the script
cd mntworkcurl -O http2015-may-nonmodelreadthedocsorgenlatest_staticdiff_expR
Now we run the differential expression script with
Rscript diff_expR
This will produce three files nema-edgeR-MA-plotpdf nema-edgeR-MDSpdf and nema-edgeRcsv The CSV filecan be opened directly in Excel you can also look at it here It consists of five columns gene name log fold changeP-value and FDR-adjusted P-value
You can also view more informative versions of these files generated from a different dataset chick-edgeR-MA-plotpdf and chick-edgeR-MDSpdf
94 Differential Expression 35
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
36 Chapter 9 Quantification and Differential Expression
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
CHAPTER 10
Remapping your reads to your assembled transcriptome
First wersquoll need to make sure bowtie2 is installed
sudo apt-get install -y bowtie2
Now create a bowtie2 index out of your transcriptome
cd mntworkgunzip -c trinity-nematostella-rawrenamedfastagz gt trinity-nematostella-rawrenamedfastabowtie2-build trinity-nematostella-rawrenamedfasta transcriptome
And then finally count the number of reads that map to your transcriptome
zcat 0Hour_ATCACG_L002_R1_001extractfastqgz | head -400000 | bowtie2 -U - -x transcriptome gt devnull
You should get something like
9718 overall alignment rate
37
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
38 Chapter 10 Remapping your reads to your assembled transcriptome
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
CHAPTER 11
Miscellaneous advice
111 Sequencing depth and number of samples
Hart et al (2013) provides a nice description and a set of tools for estimating your needed sequencing depth andnumber of samples They provide an Excel based calculator for calculating number of samples Their numbers aresurprisingly large to me )
In a proposal for an exploratory effort to discover differentially expressed genes I would suggest 3-5 biologicalreplicates with 30-50 million reads each More reads is usually cheaper than more replicates so 50-100m reads maygive you more power to resolve smaller fold changes
112 Downloading your data
If you do your sequencing at the MSU Core Facility yoursquoll get an e-mail from them when yoursquore samples are readyThe e-mail will give you an FTP site a username and a password as well as a URL You can use these to downloadyour data For example if you get
hostname titanbchmsueduusername rnaseqmodelpassword QecheJa6
URI ftprnaseqmodelQecheJa6titanbchmsuedu
you can go to ftprnaseqmodelQecheJa6titanbchmsuedu in your Web browser that is it lets you combine yourusername and password to open that link
In this case you will see a lsquotestdatarsquo directory If you click on that yoursquoll see a bunch of fastqgz files These are thefiles that you want to get onto the HPC
To download these files onto the HPC log into the HPC go to the directory on the HPC you want to put the files inand run a lsquowgetrsquo ndash for example on the HPC
mkdir ~testdatacd ~testdata
wget -r -np -nH ftprnaseqmodelQecheJa6titanbchmsuedutestdata
This will download _all_ of the files in that directory You can also do them one at a time eg to getlsquoAth_Mut_1_R1fastqgzrsquo you would do
wget ftprnaseqmodelQecheJa6titanbchmsuedutestdataAth_Mut_1_R1fastqgz
39
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
Tada
113 Developing your own pipeline
Even if all you plan to do is change the filenames yoursquore operating on yoursquoll need to develop your own analysispipeline Here are some tips
1 Start with someone elsersquos approach donrsquot design your own There are lots of partly done examples that you canfind on the Web including in this tutorial
2 Generate a data subset (the first few 100k reads for example)
2 Run commands interactively on an HPC dev node until you get all of the commands basically working track allof your commands in a Word document or some such
3 Once you have a set of commands that seems to work on small data write a script Run the script on the smalldata again make sure that works
4 Turn it into a qsub script (making sure yoursquore in the right directory have the modules loaded etc)
5 Make sure the qsub script works on your same small data
6 Scale up to a big test data set
7 Once thatrsquos all working SAVE THE SCRIPT SOMEWHERE Then edit it to work on all your data sets (youmay want to make subsets again as much as possible)
8 Provide your scripts and raw counts files as part of any publication or thesis perhaps via figshare
Next More resources
40 Chapter 11 Miscellaneous advice
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
CHAPTER 12
More resources
121 Informational resources
UT (Austin) Sequencing Core prices - costs and yields for sequencing
ANGUS - summer NGS course - lots of resources and materials and book reference
Data Carpentry - intro to R etc
Software Carpentry - more scripting Python etc
122 Places to share data scripts and results files
Figshare
41
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
42 Chapter 12 More resources
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
CHAPTER 13
Miscellaneous questions
1 When should I use de novo assembly and when should I use reference-guided (ab initio) assembly
This is always a judgement call and you can always try both (although there arenrsquot good methods for comparingthe results)
The short version is that if you have no nearby genomic sequence you must use de novo assembly if you havean incomplete genomic sequence you may want to use de novo assembly and if you have a great genomicsequence you shouldnrsquot use de novo assembly
The positives of using de novo assembly are that you do not depend in any way on the reference So if thereference genome is missing incomplete or incorrect you will not have biased results from doing it
The negatives are that you will get many more isoforms from de novo transcriptome assembly than you willfrom reference-based transcriptome assembly and the process is probably a bit more computationally intensive(and certainly more subject to problems from bad data)
2 What are ldquotranscript familiesrdquo
Transcript families and components are computational terms for ldquotranscripts that may share exonsrdquo The biolog-ical analogy to use is splice isoforms - but keep in mind that the computer canrsquot necessarily tell the differencebetween transcripts that are ldquorealrdquo splice variants noisy splicing different allelic variants of transcripts recentparalogs etc etc - all the computer knows is that the transcripts share some amount of sequence
So transcript families are Trinityrsquos best guess at transcripts that come from the same locus
3 What should we look at in FastQC results for RNAseq data
The main thing to pay attention to is the first graph of quality scores vs position If your average quality takesa big dip at a particular position you might consider trimming at that position
4 How do we transfer our data to Amazon (or any remote computer)
There are two options ndash
If your data is on your local computer you can use Cyberduck to transfer the data to Amazon (see Tips andTricks for working with Remote Computers)
If the data is on a remote computer (like your sequencing center) you can probably use lsquocurlrsquo or lsquowgetrsquo to copythe data directly from the sequencing center to your Amazon computer You should ask them what the full URL(with username and password) is to each of your data sets or find your local computer expert to help out
5 How do we use Amazon to run full analyses
See Tips and Tricks for working with Remote Computers ldquoRunning full analysesrdquo
6 Can we use XSEDE or iPlant or ltinsert other platform heregt to run these analyses
43
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
Yes but you should omit all of the lsquoapt-getrsquo and lsquopip installrsquo instructions - the sysadmins on those computerswill need to install these programs for you
7 How do we know if our reference transcriptome is ldquogood enoughrdquo
See Remapping your reads to your assembled transcriptome
8 How do I choose the set of tools to use
Our recommendations in order
(a) Find a tool that a nearby lab is using and start there
(b) Look at tools and workflows that are used in published papers by groups working in your area
(c) Look for good tutorials online
44 Chapter 13 Miscellaneous questions
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
CHAPTER 14
Tips and Tricks for working with Remote Computers
141 Use screen to run things that take a long time
Often you want to run things that will take days or weeks to run The lsquoscreenrsquo command will let you run programs andrecord the output and then come back later and ldquoreconnectrdquo
For example try running the beginning bit of digital normalization (Running digital normalization) inside of screen
screencd mntworknormalize-by-medianpy -k 20 -p -C 20 -N 4 -x 2e9 -s normC20k20ct peqcfqgz
The normalize-by-median command will take a while but now that itrsquos running in screen you can ldquodetachrdquo from yourremote computer and walk away for a bit For example
bull close your terminal window
bull open up a new one and connect into your Amazon machine
bull type lsquoscreen -rrsquo to reconnect into your running screen
(See amazonusing-screen for a more complete rundown on instructions)
142 Use CyberDuck to transfer files
To transfer remote files to your local laptop or local laptop files to the remote system try using CyberDuck Wersquollwalk through it in class
143 Subsetting data
If you want to generate a small subset of a FASTQ file for testing you can do something like this
gunzip -c mntdataSRR534005_1fastqgz | head -400000 | gzip gt samplefqgz
This will take 400000 lines (or 100000 FASTQ records) from the beginning of the SRR534005_1fastqgz fileand put them in the samplefqgz file
45
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
labibi Documentation Release 10
144 Running full analyses on Amazon Web Services
You need to do three things to run a full analysis on AWS (or really any cloud machine) ndash
1 you need to get your data onto that machine
2 you need to be prepared to let things run for a long time
3 you need to have a large disk to store all the intermediate files A good rule of thumb is that every 200 millionreads requires about a TB of intermediate disk space
Getting your data onto the machine can be done by using the lsquocurlrsquo command to download data from (eg) yoursequencing core This will be core specific and itrsquos something we can help you with when you need the help
To let things run for a long time you basically need to run them in screen (see above ldquoUse screenrdquo)
By default Amazon doesnrsquot give you really big hard disks on your machine ndash you can use lsquodfrsquo to take a look On anm3xlarge machine you can ask about disk space on mnt by using lsquodfrsquo (disk free)
df -k mnt
You should see something like this
Filesystem 1K-blocks Used Available Use Mounted ondevxvdb 38565344 20098736 16500940 55 mnt
which tells you that mnt has 40 GB of disk space
To add disk space to your Amazon instance see this set of instructions
httpangusreadthedocsorgen2014amazonsetting-up-an-ebs-volumehtml
The simplest advice is to make mnt a 1 TB disk which should hold a half dozen mRNAseq data sets and all theintermediate data
46 Chapter 14 Tips and Tricks for working with Remote Computers
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-
CHAPTER 15
Technical information
The github repository for this workshop is publicly available at httpsgithubcomngs-docs2015-may-nonmodel
47
- Welcome
-
- 1 Learning goals
- 2 Safe space and code of conduct
- 3 Instructor introductions
- 4 Amazon and cloud computing - why
- 5 Sticky notes and how they work + Minute Cards
- 6 Warning
-
- Non-model organisms and RNAseq
-
- The overall process
-
- Getting started with Amazon EC2
-
- More details
-
- Short read quality and trimming
-
- Prepping the computer
- Data source
- 1 Copying in some data to work with
- 1 Copying data into a working location
- 2 FastQC
- 3 Trimmomatic
- 4 FastQC again
- 5 Trim the rest of the sequences
- 6 Interleave the sequences
-
- Running digital normalization
- Running the actual assembly
- BLASTing your assembled data
- Annotation of denovo transcriptome
-
- Identify the GeneTranscript relationships
- Generate the longest-ORF peptide candidates from the Trinity Assembly
- Capturing BLAST Homologies
- Characterization of functional annotation features
- Integration of all annotations into one database
- Output an Annotation Report
-
- Quantification and Differential Expression
-
- Download Express
- Align Reads with Bowtie
- Quantify Expression using eXpress
- Differential Expression
-
- Remapping your reads to your assembled transcriptome
- Miscellaneous advice
-
- Sequencing depth and number of samples
- Downloading your data
- Developing your own pipeline
-
- More resources
-
- Informational resources
- Places to share data scripts and results files
-
- Miscellaneous questions
- Tips and Tricks for working with Remote Computers
-
- Use screen to run things that take a long time
- Use CyberDuck to transfer files
- Subsetting data
- Running full analyses on Amazon Web Services
-
- Technical information
-