Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next...

Introduction to Next Generation Sequencing Analysis: Part I

Short Read Mapping and Visualization

Phillip Richmond@Phil_A_Richmond

November 23rd, 2016

Workshop outline

1. Introduction2. Preparing your workshop directory3. Short Read Mapping “Pipeline”4. Learn how to use BWA and Samtools5. Analyze example dataset 6. Visualize example dataset7. Q & A, work individually on additional samples

Welcome!

● Welcome to the UBC Advanced Research Computing (ARC) Workshop!● As the first session in what is hopefully a useful series, we are open to

comments/critiques on what works/fails● Info about ARC and who we are

○ https://arc.ubc.ca/

● Info about WestGrid for when you fall in love and want to pursue further analysis on the High Performance Compute (HPC) systems

○ https://www.westgrid.ca/

https://arc.ubc.ca/

https://arc.ubc.ca/

https://www.westgrid.ca/

Learning Goals

● Learn to interact with WestGrid compute environment and queuing system● Explore command-line usage of popular Bioinformatics Tools used in an

abundance of applications● Learn about file formats (Fastq, SAM, BAM, Fasta)● Visualize mapped reads using Integrative Genomics Viewer (IGV)● Gain confidence in the ability to analyze your own data!

Interactive Experience

We hope this is an interactive experience for all of you

Questions/Problems can be posted to the group-chat in vidyo, or to this google doc:

https://docs.google.com/document/d/15nwI7Bl2Y1Miyk_yE4-WvduAkweYZ1-LEzjRw7p__JM/edit

We have 4 TAs to assist in answering questions and solving problems, at the end of the session I can address unresolved questions




Computing via servers

● User interacts with their own desktop

● Through a terminal, they can communicate with the head node

● The head node communicates with the execution nodes through the job scheduler

terminal

Head node

ssh connection

Job scheduler,Job scripts

orcinus.westgrid.ca

Short Read Sequencing

http://bitesizebio.com/13546/sequencing-by-synthesis-explaining-the-illumina-sequencing-technology/

● Several genomics applications for short-read DNA sequencing and alignment● Variant/Mutation calling● Protein:DNA/RNA interactions

○ ChIP-seq, Clip-Seq

● 3-D Chromatin Organization○ Capture Hi-C

● Regulatory Sequence Analysis○ MPRA, STARR-seq, CRE-seq,

CREST-Seq

● Transcriptional analysis○ GRO-seq, RNA-seq, Ribo-seq,

CAGE-seq, 3’-Seq

Let’s get started! Login to Orcinus

You should have already attempted this by now, but as a reminder:

1. Open up a terminal (PC: MobaXterm, Putty | Mac/Linux: Terminal)2. Login to Orcinus

$ ssh <username>@orcinus.westgrid.ca

NOTE: Whenever you see me represent something with the <>, I want you to replace it with what applies to you. Also, whenever there is a “$”, I am showing you a command

Example:$ ssh [email protected]

mailto:[email protected]

Orcinus Filesystem Organization

/

global/ tmp/ (ignore... ...the rest)home/

user02/scratch/

user.../software/

ARC_Training/

user01/richmonp/

GENOME/ PROCESS/ Quiz/ RAW_DATA/ SCRIPTS/

First logging in: Your home directory

/


user02/scratch/

user.../software/

ARC_Training/

user01/richmonp/


Command (Print working directory):$ pwd

Let’s explore: /global/scratch/ARC_Training/

/


user02/scratch/

user.../software/

ARC_Training/

user01/richmonp/


Command:$ cd /global/scratch/ARC_Training/

Make yourself a “Workshop” directory inside of PROCESS/, title it: <LASTNAME>/

/

global/home/

user02/scratch/

user.../

ARC_Training/

user01/richmonp/


Example: $ mkdir /global/scratch/ARC_Training/RICHMOND/

RICHMOND/

Let’s copy some files into your Workshop Directory/

global/

scratch/

ARC_Training/


Example: $ cp /global/scratch/ARC_Training/RAW_DATA/NA20845* /global/scratch/ARC_Training/PROCESS/RICHMOND/

RICHMOND/

NA20845.chr19.subregion_R1.fastq NA20845.chr19.subregion_R2.fastq

What are these Files?

● These files come from the 1000 Genomes Project, and represent paired-end sequencing raw-data files

● Lots of data is available in this format through the Short Read Archive (SRA)○ https://www.ncbi.nlm.nih.gov/sra

● Fastq (AKA: FastQ, fq) files contain raw reads sequence “reads”, and for paired-end reads, the files are sorted so that for each line, the read in the _R1 file has a corresponding read in the _R2 file

● You can look at the contents of the file using the head command:

$ head <filename>

FastQ file format

● File extension .fastq or .fq

Example:

@Read_identifier_and_flowcell_infoACGTCCGGTTNNN…+B$!?NP\\\[%&C…

ReadNameSequence+Quality Score

https://en.wikipedia.org/wiki/FASTQ_format

Qua

lity

scor

e

Probability of error

Let’s also explore some human genome files/

global/

scratch/

ARC_Training/


RICHMOND/

genome.fa genome.fa.ann

genome.fa.bwt

genome.fa.amb

genome.fa.pacgenome.fa.sagenome.fa.fai

Example: $ more genome.fa

Pipeline Overview

Sample.Reads1.fastq

Sample.Reads2.fastq

genome.fa*

(genome.fa.anngenome.fa.ambgenome.fa.pacgenome.fa.bwtgenome.fa.sa)

BWA mem

Raw reads

Genome indexSample.sam

samtoolsview

samtoolssort

samtoolsindex

Sample.bam

Sample.sorted.bam

Sample.sorted.bam.bai

File format conversion

Read mapping

IGVVisualization

First: Read mapping

Sample.Reads1.fastq

Sample.Reads2.fastq

genome.fa*


BWA mem

Raw reads


samtoolsview

samtoolssort

samtoolsindex

Sample.bam

Sample.sorted.bam



Read mapping

IGVVisualization

Learning the bwa commandFirst we need to load the module that has the bwa command in it$ module load bio-tools

Next we will call the bwa mem command to see how it’s used$ bwa mem

Let’s break down this usage statement:$ bwa mem [options] <idxbase> <in1.fq> [in2.fq]

[ ] is an optional argument<> is required and is asking you to replace what’s inside with the appropriate value

Example:$ bwa mem genome.fa Sample.Reads1.fastq Sample.Reads2.fastq > Sample.sam

Next: File Format Conversion

Sample.Reads1.fastq

Sample.Reads2.fastq

genome.fa*


BWA mem

Raw reads


samtoolsview

samtoolssort

samtoolsindex

Sample.bam

Sample.sorted.bam



Read mapping

IGVVisualization

Learning the samtools commands

We will use 3 samtools operations: view, sort, and index (in that order)

$ samtools view -b <in.sam> -o <out.bam>$ samtools view -b Sample1.sam -o Sample1.bam

$ samtools sort <in.bam> <out.sorted>$ samtools sort Sample1.bam Sample1.sorted

$ samtools index <in.sorted.bam> $ samtools index Sample1.sorted.bam

Let’s chat briefly about the queue

Interacting with the queue is done with a few commands:

Submit a queue script:

$ qsub <file.pbs>

Check the status of the queue

$ showq

$ qstat

Check the status of your jobs in the queue

$ showq -u <username>

$ showq -u richmonp

terminal

Head node

ssh connection

Job scheduler,Job scripts

orcinus.westgrid.ca

The .pbs queue script

● The best resource for understanding queue scripts is:○ https://www.westgrid.ca/support/running_jobs

● Lucky for you, I’ve made a script with the bwa mem and samtools commands in it.

● Copy this script into your Workshop directory:

/global/scratch/ARC_Training/SCRIPTS/MapAndConvert.pbs

Example:$ cp /global/scratch/ARC_Training/SCRIPTS/MapAndConvert.pbs /global/scratch/ARC_Training/PROCESS/RICHMOND/

https://www.westgrid.ca/support/running_jobs

https://www.westgrid.ca/support/running_jobs

Open MapAndConvert.pbs in emacs

#!/bin/bash#PBS -S /bin/bash

## I want 4 processors#PBS -l procs=4

## How much RAM does each processor need?#PBS -l pmem=2000mb

## The maximum walltime that will be used for my job#PBS -l walltime=00:15:00

## I want email sent when the job begins, ends and aborts (bea)#PBS -m bea

## Where I want the email to be sent#PBS -M [email protected]

Make sure you edit this to be your own email address (doesn’t have to be gmail)

$ emacs <filename>$ emacs /global/scratch/ARC_Training/PROCESS/RICHMOND/MapAndConvert.pbs

Edit MapAndConvert.pbs, change RICHMOND

## Load the module containing bwa and samtoolsmodule load bio-tools

## Map with BWAbwa mem -t 4 /global/scratch/ARC_Training/PROCESS/RICHMOND/genome.fa /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion_R1.fastq /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion_R2.fastq > /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.sam

## Convert sam to bam using samtools viewsamtools view -b /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.sam -o /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.bam

## Sort the bam file using samtools sortsamtools sort /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.bam /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.sorted

## Index the sorted bamsamtools index /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.sorted.bam

Make sure you change all instances of RICHMOND to your own last name

Now we can run our job in the queue

Submit job using qsub

$ qsub <file.pbs>$ qsub /global/scratch/ARC_Training/PROCESS/RICHMOND/

Check job status using showq or qstat

$ qstat -u <username>$ qstat -u richmonp

$ showq -u <username>$ showq -u richmonp

The output SAM file

@SQ - Sequence (contig/chromosome) from reference file@PG - Program information about mapping@RG - Read group information (we won’t have any here)

Tab delimited, each line is 1 read. Pairs will be next to each other in the file (e.g. Line1: Read1Line2: Read2

https://samtools.github.io/hts-specs/SAMv1.pdf

Bam file is a binary format of that sam file

We cannot look at these binary files the same way as we look at text files

Downstream applications will almost always ask for a .bam file

Sorting is necessary for downstream applications

Index will be required for IGV

Data visualization

Sample.Reads1.fastq

Sample.Reads2.fastq

genome.fa*


BWA mem

Raw reads


samtoolsview

samtoolssort

samtoolsindex

Sample.bam

Sample.sorted.bam



Read mapping

IGVVisualization

Use FileZilla to transfer files onto your own computer

Open up IGV, and load the file we just created

In the search box, type: chr19:1,201,956-1,242,206

Search box Zoom tool

Congratulations!

In the remaining time, please try to repeat what we just did with these other raw data files in /global/scratch/ARC_Training/RAW_DATA/

Also, you can try to use different options with bwa mem:

-k 25-B 10-O 12,12

Visualize multiple samples at the same time in IGV

Thanks for participating!

We will contact you in a few days with the following:

1. A form for feedback on the course2. The date of an “office hours” session in the next two weeks regarding this

material3. Information about how to get a full account on WestGrid for future analysis

projects for those with temporary logins

Hope to see you all at the next workshop!

Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next...

Documents

Transcript of Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next...