Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next...

34
Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip Richmond @Phil_A_Richmond November 23rd, 2016

Transcript of Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next...

Page 1: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Introduction to Next Generation Sequencing Analysis: Part I

Short Read Mapping and Visualization

Phillip Richmond@Phil_A_Richmond

November 23rd, 2016

Page 2: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Workshop outline

1. Introduction2. Preparing your workshop directory3. Short Read Mapping “Pipeline”4. Learn how to use BWA and Samtools5. Analyze example dataset 6. Visualize example dataset7. Q & A, work individually on additional samples

Page 3: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Welcome!

● Welcome to the UBC Advanced Research Computing (ARC) Workshop!● As the first session in what is hopefully a useful series, we are open to

comments/critiques on what works/fails● Info about ARC and who we are

○ https://arc.ubc.ca/

● Info about WestGrid for when you fall in love and want to pursue further analysis on the High Performance Compute (HPC) systems

○ https://www.westgrid.ca/

Page 4: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Learning Goals

● Learn to interact with WestGrid compute environment and queuing system● Explore command-line usage of popular Bioinformatics Tools used in an

abundance of applications● Learn about file formats (Fastq, SAM, BAM, Fasta)● Visualize mapped reads using Integrative Genomics Viewer (IGV)● Gain confidence in the ability to analyze your own data!

Page 5: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Interactive Experience

We hope this is an interactive experience for all of you

Questions/Problems can be posted to the group-chat in vidyo, or to this google doc:

https://docs.google.com/document/d/15nwI7Bl2Y1Miyk_yE4-WvduAkweYZ1-LEzjRw7p__JM/edit

We have 4 TAs to assist in answering questions and solving problems, at the end of the session I can address unresolved questions

Page 6: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Computing via servers

● User interacts with their own desktop

● Through a terminal, they can communicate with the head node

● The head node communicates with the execution nodes through the job scheduler

terminal

Head node

ssh connection

Job scheduler,Job scripts

orcinus.westgrid.ca

Page 7: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Short Read Sequencing

http://bitesizebio.com/13546/sequencing-by-synthesis-explaining-the-illumina-sequencing-technology/

● Several genomics applications for short-read DNA sequencing and alignment● Variant/Mutation calling● Protein:DNA/RNA interactions

○ ChIP-seq, Clip-Seq

● 3-D Chromatin Organization○ Capture Hi-C

● Regulatory Sequence Analysis○ MPRA, STARR-seq, CRE-seq,

CREST-Seq

● Transcriptional analysis○ GRO-seq, RNA-seq, Ribo-seq,

CAGE-seq, 3’-Seq

Page 8: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Let’s get started! Login to Orcinus

You should have already attempted this by now, but as a reminder:

1. Open up a terminal (PC: MobaXterm, Putty | Mac/Linux: Terminal)2. Login to Orcinus

$ ssh <username>@orcinus.westgrid.ca

NOTE: Whenever you see me represent something with the <>, I want you to replace it with what applies to you. Also, whenever there is a “$”, I am showing you a command

Example:$ ssh [email protected]

Page 9: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Orcinus Filesystem Organization

/

global/ tmp/ (ignore... ...the rest)home/

user02/scratch/

user.../software/

ARC_Training/

user01/richmonp/

GENOME/ PROCESS/ Quiz/ RAW_DATA/ SCRIPTS/

Page 10: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

First logging in: Your home directory

/

global/ tmp/ (ignore... ...the rest)home/

user02/scratch/

user.../software/

ARC_Training/

user01/richmonp/

GENOME/ PROCESS/ Quiz/ RAW_DATA/ SCRIPTS/

Command (Print working directory):$ pwd

Page 11: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Let’s explore: /global/scratch/ARC_Training/

/

global/ tmp/ (ignore... ...the rest)home/

user02/scratch/

user.../software/

ARC_Training/

user01/richmonp/

GENOME/ PROCESS/ Quiz/ RAW_DATA/ SCRIPTS/

Command:$ cd /global/scratch/ARC_Training/

Page 12: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Make yourself a “Workshop” directory inside of PROCESS/, title it: <LASTNAME>/

/

global/home/

user02/scratch/

user.../

ARC_Training/

user01/richmonp/

GENOME/ PROCESS/ Quiz/ RAW_DATA/ SCRIPTS/

Example: $ mkdir /global/scratch/ARC_Training/RICHMOND/

RICHMOND/

Page 13: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Let’s copy some files into your Workshop Directory/

global/

scratch/

ARC_Training/

GENOME/ PROCESS/ Quiz/ RAW_DATA/ SCRIPTS/

Example: $ cp /global/scratch/ARC_Training/RAW_DATA/NA20845* /global/scratch/ARC_Training/PROCESS/RICHMOND/

RICHMOND/

NA20845.chr19.subregion_R1.fastq NA20845.chr19.subregion_R2.fastq

Page 14: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

What are these Files?

● These files come from the 1000 Genomes Project, and represent paired-end sequencing raw-data files

● Lots of data is available in this format through the Short Read Archive (SRA)○ https://www.ncbi.nlm.nih.gov/sra

● Fastq (AKA: FastQ, fq) files contain raw reads sequence “reads”, and for paired-end reads, the files are sorted so that for each line, the read in the _R1 file has a corresponding read in the _R2 file

● You can look at the contents of the file using the head command:

$ head <filename>

Page 15: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

FastQ file format

● File extension .fastq or .fq

Example:

@Read_identifier_and_flowcell_infoACGTCCGGTTNNN…+B$!?NP\\\[%&C…

ReadNameSequence+Quality Score

https://en.wikipedia.org/wiki/FASTQ_format

Qua

lity

scor

e

Probability of error

Page 16: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Let’s also explore some human genome files/

global/

scratch/

ARC_Training/

GENOME/ PROCESS/ Quiz/ RAW_DATA/ SCRIPTS/

RICHMOND/

genome.fa genome.fa.ann

genome.fa.bwt

genome.fa.amb

genome.fa.pacgenome.fa.sagenome.fa.fai

Example: $ more genome.fa

Page 17: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Pipeline Overview

Sample.Reads1.fastq

Sample.Reads2.fastq

genome.fa*

(genome.fa.anngenome.fa.ambgenome.fa.pacgenome.fa.bwtgenome.fa.sa)

BWA mem

Raw reads

Genome indexSample.sam

samtoolsview

samtoolssort

samtoolsindex

Sample.bam

Sample.sorted.bam

Sample.sorted.bam.bai

File format conversion

Read mapping

IGVVisualization

Page 18: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

First: Read mapping

Sample.Reads1.fastq

Sample.Reads2.fastq

genome.fa*

(genome.fa.anngenome.fa.ambgenome.fa.pacgenome.fa.bwtgenome.fa.sa)

BWA mem

Raw reads

Genome indexSample.sam

samtoolsview

samtoolssort

samtoolsindex

Sample.bam

Sample.sorted.bam

Sample.sorted.bam.bai

File format conversion

Read mapping

IGVVisualization

Page 19: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Learning the bwa commandFirst we need to load the module that has the bwa command in it$ module load bio-tools

Next we will call the bwa mem command to see how it’s used$ bwa mem

Let’s break down this usage statement:$ bwa mem [options] <idxbase> <in1.fq> [in2.fq]

[ ] is an optional argument<> is required and is asking you to replace what’s inside with the appropriate value

Example:$ bwa mem genome.fa Sample.Reads1.fastq Sample.Reads2.fastq > Sample.sam

Page 20: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Next: File Format Conversion

Sample.Reads1.fastq

Sample.Reads2.fastq

genome.fa*

(genome.fa.anngenome.fa.ambgenome.fa.pacgenome.fa.bwtgenome.fa.sa)

BWA mem

Raw reads

Genome indexSample.sam

samtoolsview

samtoolssort

samtoolsindex

Sample.bam

Sample.sorted.bam

Sample.sorted.bam.bai

File format conversion

Read mapping

IGVVisualization

Page 21: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Learning the samtools commands

We will use 3 samtools operations: view, sort, and index (in that order)

$ samtools view -b <in.sam> -o <out.bam>$ samtools view -b Sample1.sam -o Sample1.bam

$ samtools sort <in.bam> <out.sorted>$ samtools sort Sample1.bam Sample1.sorted

$ samtools index <in.sorted.bam> $ samtools index Sample1.sorted.bam

Page 22: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Let’s chat briefly about the queue

Interacting with the queue is done with a few commands:

Submit a queue script:

$ qsub <file.pbs>

Check the status of the queue

$ showq

$ qstat

Check the status of your jobs in the queue

$ showq -u <username>

$ showq -u richmonp

terminal

Head node

ssh connection

Job scheduler,Job scripts

orcinus.westgrid.ca

Page 23: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

The .pbs queue script

● The best resource for understanding queue scripts is:○ https://www.westgrid.ca/support/running_jobs

● Lucky for you, I’ve made a script with the bwa mem and samtools commands in it.

● Copy this script into your Workshop directory:

/global/scratch/ARC_Training/SCRIPTS/MapAndConvert.pbs

Example:$ cp /global/scratch/ARC_Training/SCRIPTS/MapAndConvert.pbs /global/scratch/ARC_Training/PROCESS/RICHMOND/

Page 24: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Open MapAndConvert.pbs in emacs

#!/bin/bash#PBS -S /bin/bash

## I want 4 processors#PBS -l procs=4

## How much RAM does each processor need?#PBS -l pmem=2000mb

## The maximum walltime that will be used for my job#PBS -l walltime=00:15:00

## I want email sent when the job begins, ends and aborts (bea)#PBS -m bea

## Where I want the email to be sent#PBS -M [email protected]

Make sure you edit this to be your own email address (doesn’t have to be gmail)

$ emacs <filename>$ emacs /global/scratch/ARC_Training/PROCESS/RICHMOND/MapAndConvert.pbs

Page 25: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Edit MapAndConvert.pbs, change RICHMOND

## Load the module containing bwa and samtoolsmodule load bio-tools

## Map with BWAbwa mem -t 4 /global/scratch/ARC_Training/PROCESS/RICHMOND/genome.fa /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion_R1.fastq /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion_R2.fastq > /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.sam

## Convert sam to bam using samtools viewsamtools view -b /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.sam -o /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.bam

## Sort the bam file using samtools sortsamtools sort /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.bam /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.sorted

## Index the sorted bamsamtools index /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.sorted.bam

Make sure you change all instances of RICHMOND to your own last name

Page 26: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Now we can run our job in the queue

Submit job using qsub

$ qsub <file.pbs>$ qsub /global/scratch/ARC_Training/PROCESS/RICHMOND/

Check job status using showq or qstat

$ qstat -u <username>$ qstat -u richmonp

$ showq -u <username>$ showq -u richmonp

Page 27: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

The output SAM file

@SQ - Sequence (contig/chromosome) from reference file@PG - Program information about mapping@RG - Read group information (we won’t have any here)

Tab delimited, each line is 1 read. Pairs will be next to each other in the file (e.g. Line1: Read1Line2: Read2

https://samtools.github.io/hts-specs/SAMv1.pdf

Page 28: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Bam file is a binary format of that sam file

We cannot look at these binary files the same way as we look at text files

Downstream applications will almost always ask for a .bam file

Sorting is necessary for downstream applications

Index will be required for IGV

Page 29: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Data visualization

Sample.Reads1.fastq

Sample.Reads2.fastq

genome.fa*

(genome.fa.anngenome.fa.ambgenome.fa.pacgenome.fa.bwtgenome.fa.sa)

BWA mem

Raw reads

Genome indexSample.sam

samtoolsview

samtoolssort

samtoolsindex

Sample.bam

Sample.sorted.bam

Sample.sorted.bam.bai

File format conversion

Read mapping

IGVVisualization

Page 30: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Use FileZilla to transfer files onto your own computer

Page 31: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Open up IGV, and load the file we just created

Page 32: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

In the search box, type: chr19:1,201,956-1,242,206

Search box Zoom tool

Page 33: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Congratulations!

In the remaining time, please try to repeat what we just did with these other raw data files in /global/scratch/ARC_Training/RAW_DATA/

Also, you can try to use different options with bwa mem:

-k 25-B 10-O 12,12

Visualize multiple samples at the same time in IGV

Page 34: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip

Thanks for participating!

We will contact you in a few days with the following:

1. A form for feedback on the course2. The date of an “office hours” session in the next two weeks regarding this

material3. Information about how to get a full account on WestGrid for future analysis

projects for those with temporary logins

Hope to see you all at the next workshop!