Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next...
Transcript of Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next...
Introduction to Next Generation Sequencing Analysis: Part I
Short Read Mapping and Visualization
Phillip Richmond@Phil_A_Richmond
November 23rd, 2016
Workshop outline
1. Introduction2. Preparing your workshop directory3. Short Read Mapping “Pipeline”4. Learn how to use BWA and Samtools5. Analyze example dataset 6. Visualize example dataset7. Q & A, work individually on additional samples
Welcome!
● Welcome to the UBC Advanced Research Computing (ARC) Workshop!● As the first session in what is hopefully a useful series, we are open to
comments/critiques on what works/fails● Info about ARC and who we are
○ https://arc.ubc.ca/
● Info about WestGrid for when you fall in love and want to pursue further analysis on the High Performance Compute (HPC) systems
○ https://www.westgrid.ca/
Learning Goals
● Learn to interact with WestGrid compute environment and queuing system● Explore command-line usage of popular Bioinformatics Tools used in an
abundance of applications● Learn about file formats (Fastq, SAM, BAM, Fasta)● Visualize mapped reads using Integrative Genomics Viewer (IGV)● Gain confidence in the ability to analyze your own data!
Interactive Experience
We hope this is an interactive experience for all of you
Questions/Problems can be posted to the group-chat in vidyo, or to this google doc:
https://docs.google.com/document/d/15nwI7Bl2Y1Miyk_yE4-WvduAkweYZ1-LEzjRw7p__JM/edit
We have 4 TAs to assist in answering questions and solving problems, at the end of the session I can address unresolved questions
Computing via servers
● User interacts with their own desktop
● Through a terminal, they can communicate with the head node
● The head node communicates with the execution nodes through the job scheduler
terminal
Head node
ssh connection
Job scheduler,Job scripts
orcinus.westgrid.ca
Short Read Sequencing
http://bitesizebio.com/13546/sequencing-by-synthesis-explaining-the-illumina-sequencing-technology/
● Several genomics applications for short-read DNA sequencing and alignment● Variant/Mutation calling● Protein:DNA/RNA interactions
○ ChIP-seq, Clip-Seq
● 3-D Chromatin Organization○ Capture Hi-C
● Regulatory Sequence Analysis○ MPRA, STARR-seq, CRE-seq,
CREST-Seq
● Transcriptional analysis○ GRO-seq, RNA-seq, Ribo-seq,
CAGE-seq, 3’-Seq
Let’s get started! Login to Orcinus
You should have already attempted this by now, but as a reminder:
1. Open up a terminal (PC: MobaXterm, Putty | Mac/Linux: Terminal)2. Login to Orcinus
$ ssh <username>@orcinus.westgrid.ca
NOTE: Whenever you see me represent something with the <>, I want you to replace it with what applies to you. Also, whenever there is a “$”, I am showing you a command
Example:$ ssh [email protected]
Orcinus Filesystem Organization
/
global/ tmp/ (ignore... ...the rest)home/
user02/scratch/
user.../software/
ARC_Training/
user01/richmonp/
GENOME/ PROCESS/ Quiz/ RAW_DATA/ SCRIPTS/
First logging in: Your home directory
/
global/ tmp/ (ignore... ...the rest)home/
user02/scratch/
user.../software/
ARC_Training/
user01/richmonp/
GENOME/ PROCESS/ Quiz/ RAW_DATA/ SCRIPTS/
Command (Print working directory):$ pwd
Let’s explore: /global/scratch/ARC_Training/
/
global/ tmp/ (ignore... ...the rest)home/
user02/scratch/
user.../software/
ARC_Training/
user01/richmonp/
GENOME/ PROCESS/ Quiz/ RAW_DATA/ SCRIPTS/
Command:$ cd /global/scratch/ARC_Training/
Make yourself a “Workshop” directory inside of PROCESS/, title it: <LASTNAME>/
/
global/home/
user02/scratch/
user.../
ARC_Training/
user01/richmonp/
GENOME/ PROCESS/ Quiz/ RAW_DATA/ SCRIPTS/
Example: $ mkdir /global/scratch/ARC_Training/RICHMOND/
RICHMOND/
Let’s copy some files into your Workshop Directory/
global/
scratch/
ARC_Training/
GENOME/ PROCESS/ Quiz/ RAW_DATA/ SCRIPTS/
Example: $ cp /global/scratch/ARC_Training/RAW_DATA/NA20845* /global/scratch/ARC_Training/PROCESS/RICHMOND/
RICHMOND/
NA20845.chr19.subregion_R1.fastq NA20845.chr19.subregion_R2.fastq
What are these Files?
● These files come from the 1000 Genomes Project, and represent paired-end sequencing raw-data files
● Lots of data is available in this format through the Short Read Archive (SRA)○ https://www.ncbi.nlm.nih.gov/sra
● Fastq (AKA: FastQ, fq) files contain raw reads sequence “reads”, and for paired-end reads, the files are sorted so that for each line, the read in the _R1 file has a corresponding read in the _R2 file
● You can look at the contents of the file using the head command:
$ head <filename>
FastQ file format
● File extension .fastq or .fq
Example:
@Read_identifier_and_flowcell_infoACGTCCGGTTNNN…+B$!?NP\\\[%&C…
ReadNameSequence+Quality Score
https://en.wikipedia.org/wiki/FASTQ_format
Qua
lity
scor
e
Probability of error
Let’s also explore some human genome files/
global/
scratch/
ARC_Training/
GENOME/ PROCESS/ Quiz/ RAW_DATA/ SCRIPTS/
RICHMOND/
genome.fa genome.fa.ann
genome.fa.bwt
genome.fa.amb
genome.fa.pacgenome.fa.sagenome.fa.fai
Example: $ more genome.fa
Pipeline Overview
Sample.Reads1.fastq
Sample.Reads2.fastq
genome.fa*
(genome.fa.anngenome.fa.ambgenome.fa.pacgenome.fa.bwtgenome.fa.sa)
BWA mem
Raw reads
Genome indexSample.sam
samtoolsview
samtoolssort
samtoolsindex
Sample.bam
Sample.sorted.bam
Sample.sorted.bam.bai
File format conversion
Read mapping
IGVVisualization
First: Read mapping
Sample.Reads1.fastq
Sample.Reads2.fastq
genome.fa*
(genome.fa.anngenome.fa.ambgenome.fa.pacgenome.fa.bwtgenome.fa.sa)
BWA mem
Raw reads
Genome indexSample.sam
samtoolsview
samtoolssort
samtoolsindex
Sample.bam
Sample.sorted.bam
Sample.sorted.bam.bai
File format conversion
Read mapping
IGVVisualization
Learning the bwa commandFirst we need to load the module that has the bwa command in it$ module load bio-tools
Next we will call the bwa mem command to see how it’s used$ bwa mem
Let’s break down this usage statement:$ bwa mem [options] <idxbase> <in1.fq> [in2.fq]
[ ] is an optional argument<> is required and is asking you to replace what’s inside with the appropriate value
Example:$ bwa mem genome.fa Sample.Reads1.fastq Sample.Reads2.fastq > Sample.sam
Next: File Format Conversion
Sample.Reads1.fastq
Sample.Reads2.fastq
genome.fa*
(genome.fa.anngenome.fa.ambgenome.fa.pacgenome.fa.bwtgenome.fa.sa)
BWA mem
Raw reads
Genome indexSample.sam
samtoolsview
samtoolssort
samtoolsindex
Sample.bam
Sample.sorted.bam
Sample.sorted.bam.bai
File format conversion
Read mapping
IGVVisualization
Learning the samtools commands
We will use 3 samtools operations: view, sort, and index (in that order)
$ samtools view -b <in.sam> -o <out.bam>$ samtools view -b Sample1.sam -o Sample1.bam
$ samtools sort <in.bam> <out.sorted>$ samtools sort Sample1.bam Sample1.sorted
$ samtools index <in.sorted.bam> $ samtools index Sample1.sorted.bam
Let’s chat briefly about the queue
Interacting with the queue is done with a few commands:
Submit a queue script:
$ qsub <file.pbs>
Check the status of the queue
$ showq
$ qstat
Check the status of your jobs in the queue
$ showq -u <username>
$ showq -u richmonp
terminal
Head node
ssh connection
Job scheduler,Job scripts
orcinus.westgrid.ca
The .pbs queue script
● The best resource for understanding queue scripts is:○ https://www.westgrid.ca/support/running_jobs
● Lucky for you, I’ve made a script with the bwa mem and samtools commands in it.
● Copy this script into your Workshop directory:
/global/scratch/ARC_Training/SCRIPTS/MapAndConvert.pbs
Example:$ cp /global/scratch/ARC_Training/SCRIPTS/MapAndConvert.pbs /global/scratch/ARC_Training/PROCESS/RICHMOND/
Open MapAndConvert.pbs in emacs
#!/bin/bash#PBS -S /bin/bash
## I want 4 processors#PBS -l procs=4
## How much RAM does each processor need?#PBS -l pmem=2000mb
## The maximum walltime that will be used for my job#PBS -l walltime=00:15:00
## I want email sent when the job begins, ends and aborts (bea)#PBS -m bea
## Where I want the email to be sent#PBS -M [email protected]
Make sure you edit this to be your own email address (doesn’t have to be gmail)
$ emacs <filename>$ emacs /global/scratch/ARC_Training/PROCESS/RICHMOND/MapAndConvert.pbs
Edit MapAndConvert.pbs, change RICHMOND
## Load the module containing bwa and samtoolsmodule load bio-tools
## Map with BWAbwa mem -t 4 /global/scratch/ARC_Training/PROCESS/RICHMOND/genome.fa /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion_R1.fastq /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion_R2.fastq > /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.sam
## Convert sam to bam using samtools viewsamtools view -b /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.sam -o /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.bam
## Sort the bam file using samtools sortsamtools sort /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.bam /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.sorted
## Index the sorted bamsamtools index /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.sorted.bam
Make sure you change all instances of RICHMOND to your own last name
Now we can run our job in the queue
Submit job using qsub
$ qsub <file.pbs>$ qsub /global/scratch/ARC_Training/PROCESS/RICHMOND/
Check job status using showq or qstat
$ qstat -u <username>$ qstat -u richmonp
$ showq -u <username>$ showq -u richmonp
The output SAM file
@SQ - Sequence (contig/chromosome) from reference file@PG - Program information about mapping@RG - Read group information (we won’t have any here)
Tab delimited, each line is 1 read. Pairs will be next to each other in the file (e.g. Line1: Read1Line2: Read2
https://samtools.github.io/hts-specs/SAMv1.pdf
Bam file is a binary format of that sam file
We cannot look at these binary files the same way as we look at text files
Downstream applications will almost always ask for a .bam file
Sorting is necessary for downstream applications
Index will be required for IGV
Data visualization
Sample.Reads1.fastq
Sample.Reads2.fastq
genome.fa*
(genome.fa.anngenome.fa.ambgenome.fa.pacgenome.fa.bwtgenome.fa.sa)
BWA mem
Raw reads
Genome indexSample.sam
samtoolsview
samtoolssort
samtoolsindex
Sample.bam
Sample.sorted.bam
Sample.sorted.bam.bai
File format conversion
Read mapping
IGVVisualization
Use FileZilla to transfer files onto your own computer
Open up IGV, and load the file we just created
In the search box, type: chr19:1,201,956-1,242,206
Search box Zoom tool
Congratulations!
In the remaining time, please try to repeat what we just did with these other raw data files in /global/scratch/ARC_Training/RAW_DATA/
Also, you can try to use different options with bwa mem:
-k 25-B 10-O 12,12
Visualize multiple samples at the same time in IGV
Thanks for participating!
We will contact you in a few days with the following:
1. A form for feedback on the course2. The date of an “office hours” session in the next two weeks regarding this
material3. Information about how to get a full account on WestGrid for future analysis
projects for those with temporary logins
Hope to see you all at the next workshop!