Post on 20-Mar-2020
Tutorial 1: An Introduc1on to Linux for Next-‐Gen
DNA Sequencing Data Analysis: A Hands-‐on Tutorial
Dr Stratos Efstathiadis efstra1os.efstathiadis@nyumc.org
Technical Director High Performance Compu1ng Facility
Center for Health Informa1cs and Bioinforma1cs NYULMC
This is the First Part in a Series of tutorials in High Performance Computing and Sequencing Informatics
• This tutorial will cover: – Running basic Linux commands
• Logging-in, Files and Directories, File Editing, Running command line tools.
– How to map millions of Next-Gen DNA reads onto a Genome – A complete Next-Gen Seq. Read Alignment Example – Using the Integrative Genome Viewer (IGV)
• This tutorial will Not cover: – Sequencing technologies or alignment algorithms in detail. – Advanced Linux concepts.
• File and Directory permissions, Environment variables, Shell scripts, batch job submission, etc. will be covered in another tutorial.
Bioinformatics Requires Powerful Computers
• One definition of bioinformatics is: “The use of computers to analyze biological problems.”
• As biological data sets have grown larger and biological problems have become more complex, the requirements for computing power have also grown.
• Computers that can provide this power generally use the Unix operating system - so you must learn Unix
Will computers crash Genomics? SCIENCE, 11 February 2011, pg 666
Remote Login using Secure Shell (ssh)
• Secure Shell: A set of tools that allow secure interaction with remote servers (ssh, sshd, ssh-add, ssh-keygen, ssh-agent, etc.) using two-factor authentication, based on – What you know (a pass phrase) – What you have (a key)
– ssh is the de-facto remote login mechanism in Linux/Unix. rlogin, rsh, telnet, etc. use insecure protocols (transmit clear
text passwords) and are not used (actually, blocked).
How to login remotely on a Linux server using Secure Shell (ssh)
Username: your NYULMC Kerberos id Hostname: ec2-‐aa-‐ddd-‐cc-‐nn.compute-‐1.amazonaws.com
• Troubleshoo1ng: ssh –vvvv username@hostname
• The ssh-‐agent stores the key in memory (of your laptop/desktop).
$ ssh username@ec2-aa-ddd-cc-nn.compute-1.amazonaws.com
First Linux Commands -bash-3.2$ date Wed Jan 2 13:28:49 EST 2013
-bash-3.2$ pwd (Present/current Working Directory) /home/username
-bash-3.2$ ls (list files in the pwd)
-bash-3.2$ touch myfile (creates an empty file)
Linux file names and commands are Case Sensi?ve: myfile is different than MyFile
-bash-3.2$ LS -‐bash: LS: command not found
Home directory
Working with Directories
• Directories are a means of organizing your files on a Unix computer. – They are equivalent to folders on Windows and Macintosh computers
• Directories contain files, executable programs, and sub-‐directories
• Understanding how to use directories is crucial to manipula1ng your files on a Unix system.
The Tree structure of the file system
Your Home Directory
• When you login to the server, you always start in your Home directory.
• Create sub-‐directories to store specific projects or groups of informa1on, just as you would place folders in a filing cabinet.
• Do not accumulate thousands of files with cryp1c names in your Home directory
Shortcuts
• There are some important shortcuts in Unix for specifying directories
• . (dot) means "the current directory"
• .. means "the parent directory" -‐ the directory one level above the current directory, so cd .. will move you up one level
• ~ (1lde) means your Home directory, so cd ~ will move you back to your Home.
– Just typing a plain cd will also bring you back to your home directory
File & Directory Commands • This is a minimal list of Unix commands that you must know for file management:
ls (list) mkdir (make directory)
cd (change directory) rmdir (remove directory)
cp (copy) pwd (present working directory)
mv (move) more (view by page)
rm (remove) cat (view entire file on screen)
• All of these commands can be modified with many op1ons. Learn to use Unix ‘man’ pages for more informa1on.
Copy & Move • cp lets you copy a file from any directory to any other directory, or create a copy of a file with a new name in one directory
• cp filename.ext newfilename.ext• cp filename.ext subdir/newname.ext• cp /u/jdoe01/filename.ext ./subdir/newfilename.ext
• mv allows you to move files to other directories, but it is also used to rename files. – Filename and directory syntax for mv is exactly the same as for the cp command.
• mv filename.ext subdir/newfilename.ext
– NOTE: When you use mv to move a file into another directory, the current file is deleted.
Delete • Use the command rm (remove) to delete files
• There is no way to undo this command!!! – We have set the server to ask if you really want to remove each file before it is deleted.
– You must answer “Y” or else the file is not deleted.
> ls af151074.gb_pr5 test.seq > rm test.seq rm: remove test.seq? y > ls af151074.gb_pr5
Exercise: Working with Files and Directories
-bash-3.2$ mkdir project1 (Make Directory) -bash-3.2$ cd project1 (Change Directory) -bash-3.2$ pwd /home/efstae01/project1 -bash-3.2$ cp /data/tutorial/ChIPseq_chr19.fastq . -bash-3.2$ ls –la
-bash-3.2$ head ChIPseq_chr19.fastq @HWUSI-EAS610_0001:3:1:4:1405#0/1 GATAGTTCAATTCCAGAGATCAGAGAGAGGTGAGTG + B;30;<4@7/5@=?5?7?1>A2?0<6?<<80>79## @HWUSI-EAS610_0001:3:1:5:1490#0/1 GGGCTGGTGGAGTGATCCCAAGGGGTGGGGATGGGG + B@A?AAA1BB;A5B44>AA3'@AB>+>@AB94A?A? @HWUSI-EAS610_0001:3:1:6:388#0/1 CAGAGTTCATGAAATAGGCCTCTAGTCTTCCTAGAC
FASTQ Files
@HWUSI-EAS610_0001:3:1:4:1405#0/1 GATAGTTCAATTCCAGAGATCAGAGAGAGGTGAGTG + B;30;<4@7/5@=?5?7?1>A2?0<6?<<80>79##
36 bp read
• The de-facto file format for sharing sequence read data • Sequence and a per-base quality score • FASTQ Files are Text files • 4 Lines per read -bash-3.2$ wc –l ChIPseq_chr19.fastq 1082524 ChIPseq_chr19.fastq
36 Quality scores
Star?ng emacs
• To start Emacs, at the > command prompt, just type: emacs
• To use Emacs to edit a file, type: emacs filename
(where filename is the name of your file)
• When Emacs is launched, it opens either a blank text window or a window containing the text of an exis1ng file.
Save & Exit • To save a file as you are working on it, type: Ctrl-‐x » Ctrl-‐s
• To exit emacs and return to the Unix shell, type: Ctrl -‐x » Ctrl -‐c
If you have made any changes to the file, Emacs will ask you if you want to save:
Save file /u/browns02/nrdc.msf? (y,n,!,.,q,C-r or C-h)• Type “y” to save your changes and exit • If you type “n”, then it will ask again:
Modified buffers exist; exit anyway? (yes or no)• If you answer “no”, then it will return you to the file, you must answer “yes” to exit without saving changes
Copying data using Secure Copy (scp)
Exercise: Login on the tutorial Amazon Instance -‐bash-‐3.2$ cd project1 -‐bash-‐3.2$ fastqc ChIPseq_chr19.fastq -‐bash-‐3.2$ ls –l Copy the file ChIPseq_chr19.fastq.zip to your laptop using scp:
Laptop> scp user@ec2-‐54-‐242-‐89-‐16.compute-‐1.amazonaws.com:~/project1/ChIPseq_chr19_fastqc.zip .
Unzip it: Laptop> unzip ChIPseq_chr19_fastqc.zip
Open with your browser fastqc-‐report.html
Short Read Alignment • Given a reference and a set of reads, report at least one “good” local alignment for each read if one exists – Approximate answer to: where in genome did the read originate?
…TGATCATA… GATCAA
…TGATCATA… GAGAAT
beqer than
• What is “good”? For now, we concentrate on:
…TGATATTA… GATcaT
…TGATcaTA… GTACAT
beqer than
– Fewer mismatches is better – Failing to align a low-quality
base is better than failing to align a high-quality base
Short Read Applica1ons
Finding the alignments is typically the performance boqleneck
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… GCGCCCTA
GCCCTATCG GCCCTATCG
CCTATCGGA CTATCGGAAA
AAATTTGC AAATTTGC
TTTGCGGT TTGCGGTA
GCGGTATA
GTATAC…
TCGGAAATT CGGAAATTT
CGGTATAC
TAGGCTATA
GCCCTATCG GCCCTATCG
CCTATCGGA CTATCGGAAA
AAATTTGC AAATTTGC
TTTGCGGT
TCGGAAATT CGGAAATTT CGGAAATTT
AGGCTATAT AGGCTATAT AGGCTATAT GGCTATATG
CTATATGCG
…CC …CC …CCA …CCA …CCAT
ATAC… C… C…
…CCAT …CCATAG TATGCGCCC
GGTATAC… CGGTATAC
GGAAATTTG
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… ATAC… …CC
GAAATTTGC
Indexing
• Genomes and reads are too large for direct approaches like dynamic programming
• Indexing is required
• Choice of index is key to performance
Suffix tree Suffix array Seed hash tables Many variants, incl. spaced seeds
• Invented by David Wheeler in 1983 (Bell Labs). Published in 1994. “A Block Sor@ng Lossless Data Compression Algorithm” Systems Research Center Technical Report No 124. Palo Alto, CA: Digital Equipment
Corpora@on, Burrows M, Wheeler DJ. 1994
• Originally developed for compressing large files (bzip2, etc.)
• Lossless, Fully Reversible
• Alignment Tools based on BWT: bow@e, BWA, SOAP2, etc.
• Approach: – Align reads on the transformed reference genome, using an efficient index (FM index) – Solve the simple problem first (align one character) and then build on that solu1on to solve a
slightly harder problem (two characters) etc.
• Results in great speed and efficiency gains (a few GigaByte of RAM for the en1re H. Genome). Other approaches require tens of GigaBytes of memory and are much slower.
NGS Read Alignment Burrows Wheeler Transforma1on (BWT)
Generating the SAM file
-‐bash-‐3.2$ cd project1 -‐bash-‐3.2$ bwa aln /data/tutorial/mm9.fasta ChIPseq_chr19.fastq >
ChIPseq_chr19.sai bwa_aln] 17bp reads: max_diff = 2 [bwa_aln] 38bp reads: max_diff = 3 [bwa_aln] 64bp reads: max_diff = 4 [bwa_aln] 93bp reads: max_diff = 5 [bwa_aln] 124bp reads: max_diff = 6 [bwa_aln] 157bp reads: max_diff = 7 [bwa_aln] 190bp reads: max_diff = 8 [bwa_aln] 225bp reads: max_diff = 9 [bwa_aln_core] calculate SA coordinate... 25.94 sec [bwa_aln_core] write to the disk... 0.03 sec [bwa_aln_core] 262144 sequences have been processed. [bwa_aln_core] calculate SA coordinate... 0.86 sec [bwa_aln_core] write to the disk... 0.00 sec [bwa_aln_core] 270631 sequences have been processed. [main] Version: 0.6.2-‐r126 [main] CMD: bwa aln /data/tutorial/mm9.fasta ChIPseq_chr19.fastq [main] Real 1me: 28.230 sec; CPU: 28.229 sec
-‐bash-‐3.2$ bwa samse /data/tutorial/mm9.fasta ChIPseq_chr19.sai ChIPseq_chr19.fastq > ChIPseq_chr19.sam
[bwa_aln_core] convert to sequence coordinate... 3.39 sec [bwa_aln_core] refine gapped alignments... 0.43 sec [bwa_aln_core] print alignments... 0.50 sec [bwa_aln_core] 262144 sequences have been processed. [bwa_aln_core] convert to sequence coordinate... 1.75 sec [bwa_aln_core] refine gapped alignments... 0.31 sec [bwa_aln_core] print alignments... 0.01 sec [bwa_aln_core] 270631 sequences have been processed. [main] Version: 0.6.2-‐r126 [main] CMD: bwa samse /data/tutorial/mm9.fasta ChIPseq_chr19.sai ChIPseq_chr19.fastq [main] Real 1me: 6.710 sec; CPU: 6.711 sec
more • Use the command more to view at the contents of a file one screen at a 1me:
> -bash-3.2$ more ChIPseq_chr19.sam @SQ SN:chr10 LN:129993255
@SQ SN:chr11 LN:121843856@SQ SN:chr12 LN:121257530@SQ SN:chr13 LN:120284312@SQ SN:chr14 LN:125194864@SQ SN:chr15 LN:103494974@SQ SN:chr16 LN:98319150@SQ SN:chr17 LN:95272651@SQ SN:chr18 LN:90772031@SQ SN:chr19 LN:61342430@SQ SN:chr1LN:197195432@SQ SN:chr2LN:181748087@SQ SN:chr3LN:159599783@SQ SN:chr4LN:155630120@SQ SN:chr5LN:152537259@SQ SN:chr6LN:149517037@SQ SN:chr7LN:152524553@SQ SN:chr8LN:131738871@SQ SN:chr9LN:124076172@SQ SN:chrMLN:16299@SQ SN:chrXLN:166650296@SQ SN:chrYLN:15902555HWUSI-EAS610_0001:3:1:4:1405#0 16 chr1960874227 37 36M * 0 0
CACTCACCTCTCTCTGATCTCTGGAATTGAACTATC ##97>08<<?6<0?2A>1?7?5?=@5/7@4<;03;B XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:9T26HWUSI-EAS610_0001:3:1:5:1490#0 0 chr1932960373 37 36M * 0 0
GGGCTGGTGGAGTGATCCCAAGGGGTGGGGATGGGG B@A?AAA1BB;A5B44>AA3'@AB>+>@AB94A?A? XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:36HWUSI-EAS610_0001:3:1:6:388#016 chr1918177553 37 36M * 0 0
GTCTAGGAAGACTAGAGGCCTATTTCATGAACTCTG @AABAAB@@B@?BBBABABA;?<CBBBBBA>C;BBB XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:36HWUSI-EAS610_0001:3:1:7:1045#0 16 chr1960698168 37 36M * 0 0
ATGTGAGGCAATGTGCTCCATTTCCTTTCCCTATCC =>6AB?@BA<;:?AA@9AB87;.=@=:>@B@>3,?B XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:36Hit the spacebar to page down through the file
– Ctrl-‐U moves back up a page – At the boqom of the screen, more shows how much of the file has been displayed
hqp://samtools.sourceforge.net/SAM1.pdf
SAM/BAM format
V00-HWI-EAS132:3:38:959:2035#0 147 chr1 28 255 36M = 79 0 GATCTGATGGCAGAAAACCCCTCTCAGTCCGTCGTG aaX`[\`Y^Y^]ZX``\EV_BBBBBBBBBBBBBBBB NM:i:1 V00-HWI-EAS132:4:99:122:772#0 177 chr1 32 255 36M = 9127 0 AAAGGATCTGATGGCAGAAAACCCCTCTCAGTCCGT aaaaaa\OWaI_\WL\aa`Xa^]\ZUaa[XWT\^XR NM:i:1 V00-HWI-EAS132:4:44:473:970#0 25 chr1 40 255 36M * 0 0 GTCGTGGTGAAGGATCTGATGGCAGAAAACACCTCT __YaZ`W[aZNUZ[U[_TL[KVVX^QURUTDRVZBB NM:i:2 V00-HWI-EAS132:4:29:113:1934#0 99 chr1 44 255 36M = 107 0 GGGTTTTCTGCCATCAGATCCTTTACCACGACAGAC aaaQaa__``]\\_^``^a^`a`_^^^_XQ[ZS\XX NM:i:1
Query Name Ref sequence
posi1on of alignment
query sequence (same strand as ref) query quality
@HD VN:1.0 @SQ SN:chr20 AS:HG18 LN:62435964 @RG ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891 @RG ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891
Header sec1on
Alignment sec1on
Converting the SAM file to a BAM file
BAM file B: Binary format vs Text format (SAM), resul1ng in more efficient data storage. Sequencing plazorm independent.
-‐bash-3.2$ samtools view -bt /data/tutorial/mm9.fasta -o ChIPseq_chr19.bam ChIPseq_chr19.sam
[samopen] SAM header is present: 22 sequences.
-bash-3.2$ samtools sort ChIPseq_chr19.bam ChIPseq_chr19.sorted
-bash-3.2$ samtools index ChIPseq_chr19.sorted.bam
-bash-3.2$ samtools view in.bam chr2:20,100,000-20,200,000 > out.bam
-bash-3.2$ samtools view lib15103.sorted.rmdup.bam "gi|121635883|ref|NC_008769.1|:2,100,000-2,200,000"
BAM Files: Using the bitwise FLAG -f : bit set -F: bit is not set
Counting Alignments: -bash-3.2$ samtools view -c ChIPseq_chr19.sorted.bam 270631
Count unmapped reads –f 4 (the bitwise flag 0x004 is set) -bash-3.2$ samtools view -c -f 4 ChIPseq_chr19.sorted.bam 1839
Count the reads that are not unmapped (hence, count mapped alignments) –F 4 (the bitwise flag 0x0004 is Not set)
-bash-3.2$ samtools view -c -F 4 ChIPseq_chr19.sorted.bam 268792
-bash-3.2$ samtools flagstat ChIPseq_chr19.sorted.bam
Putting it all together in a shell script (use emacs) /data/tutorial/first
REFERENCE=/data/tutorial/mm9.fasta
FASTQ=ChIPseq_chr19.fastq SAI=ChIPseq_chr19.sai
SAM=ChIPseq_chr19.sam BAM=ChIPseq_chr19.bam
SORTED=ChIPseq_chr19.sorted
echo $REFERENCE; echo $FASTQ; echo $SAI
bwa aln $REFERENCE $FASTQ > $SAI bwa samse $REFERENCE $SAI $FASTQ > $SAM
samtools view -‐bt $REFERENCE -‐o $BAM $SAM samtools sort $BAM $SORTED
The screen command
• Run “screen” before you run your command • Detach by pressing: Control-‐a d • Logout
• Login again • Re-‐aOach by running the command: “screen –r”
• You may have several screen sessions acRve
Using IGV to view alignments
The Integra1ve Genome Viewer can be installed on your laptop
wget hqp://www.broadins1tute.org/igv/projects/downloads/IGV_2.1.22.zip
unzip IGV_2.1.22.zip
cd IGV_2.1.22
java -‐Xmx2g -‐jar igv.jar
You will need to transfer to your laptop: • Indexed Reference FASTA file • Reference Genome Feature File (GFF) • Sorted and indexed mapped read BAM files
Using IGV to view alignments