Lecture1: NGS Analysis on Beocat and an introduction to Perl programming for Bioinformatics 2014

Post on 08-May-2015

2.507 views 4 download

description

Files, directories, editing and pipes.

Transcript of Lecture1: NGS Analysis on Beocat and an introduction to Perl programming for Bioinformatics 2014

Files, directories, editing and pipes

NGS Analysis on Beocat and an introduction to Perl programming for Bioinformatics 2014!

!Jennifer Shelton

Before class

Please read through the following pages and install the software listed on these pages onto your laptop before coming to class:!

!https://github.com/i5K-KINBRE-script-share/FAQ/blob/master/

UsingBeocat.md!!

https://github.com/i5K-KINBRE-script-share/FAQ/blob/master/BeocatEditingTransferingFiles.md

Logging in

• Use the program “ssh” an OpenSSH SSH client (remote login program) to log into Beocat!

• You will not see text as you type your password

$ ssh EID@beocat.cis.ksu.edu password:

Terminal

Software carpentry v.5 http://software-carpentry.org/v5/gloss.html

Terminal

• We are now connected to Beocat using a command-line interface (CLI). A CLI is an interface based on typing commands, usually at a read-eval-print loop (REPL).

Software carpentry v.5 http://software-carpentry.org/v5/gloss.html

Terminal

• We are now connected to Beocat using a command-line interface (CLI). A CLI is an interface based on typing commands, usually at a read-eval-print loop (REPL).

Software carpentry v.5 http://software-carpentry.org/v5/gloss.html

Terminal

• We are now connected to Beocat using a command-line interface (CLI). A CLI is an interface based on typing commands, usually at a read-eval-print loop (REPL).

• A read-eval-print loop (REPL) is a command-line interface that reads a command from the user, executes it, prints the

result, and waits for another command.

Software carpentry v.5 http://software-carpentry.org/v5/gloss.html

Terminal

• We are now connected to Beocat using a command-line interface (CLI). A CLI is an interface based on typing commands, usually at a read-eval-print loop (REPL).

• A read-eval-print loop (REPL) is a command-line interface that reads a command from the user, executes it, prints the

result, and waits for another command.

Software carpentry v.5 http://software-carpentry.org/v5/gloss.html

Terminal

• We are now connected to Beocat using a command-line interface (CLI). A CLI is an interface based on typing commands, usually at a read-eval-print loop (REPL).

• A read-eval-print loop (REPL) is a command-line interface that reads a command from the user, executes it, prints the

result, and waits for another command.

• A graphical user interface (GUI) is a graphical user interface, usually controlled by using a mouse.

Software carpentry v.5 http://software-carpentry.org/v5/gloss.html

Shell

• shell: A command-line interface such as Bash (the Bourne-Again Shell) or the Microsoft Windows DOS shell that allows a user to interact with the operating

system.

shell

User

Software carpentry v.5 http://software-carpentry.org/v5/gloss.html!Software carpentry v.4 http://software-carpentry.org/v4/shell

Shell

shell

User

$ ps -p $$ PID TTY TIME CMD 63825 ttys002 0:00.04 -bash

Shell

shell

User

$ ps -p $$ PID TTY TIME CMD 63825 ttys002 0:00.04 -bash

“process status” program

Shell

shell

User

$ ps -p $$ PID TTY TIME CMD 63825 ttys002 0:00.04 -bash

“process status” program

PID parameter

Shell

shell

User

$ ps -p $$ PID TTY TIME CMD 63825 ttys002 0:00.04 -bash

Current process

“process status” program

PID parameter

Shell

shell

User

$ ps -p $$ PID TTY TIME CMD 63825 ttys002 0:00.04 -bash

Current process

“process status” program

PID parameter

Name of the current shell

Shell

shell

User

$ whoami bioinfo

Shell

shell

User

$ whoami bioinfo

“whoami” program

Shell

shell

User

$ whoami bioinfo

“whoami” program

User ID

Files and directories

$ pwd /homes/bioinfo

Files and directories

$ pwd /homes/bioinfo

“pwd” or print working directory program

Files and directories

$ pwd /homes/bioinfo

“pwd” or print working directory program

Current working directory

Files and directories

$ pwd /homes/bioinfo

“pwd” or print working directory program

root/

Current working directory

Files and directories

$ pwd /homes/bioinfo

“pwd” or print working directory program

root/

tmp homes bin

Current working directory

Files and directories

$ pwd /homes/bioinfo

“pwd” or print working directory program

root/

tmp homes bin

user1 bioinfo user2 Current working directory

Files and directories

$ ln -s /homes/bioinfo/pipeline_datasets/ ./ $ ls pipeline_datasets@ $ ls pipeline_datasets/RNA-SeqAlign2Ref/ sample_read_list.txt* Galaxy5-brain_2.fastq* Galaxy4-brain_1.fastq* Galaxy3-adrenal_2.fastq* Galaxy2-adrenal_1.fastq* Galaxy1-iGenomes_UCSC_hg19_chr19_gene_annotation.gtf* hg19.fa*

“ln” or link program with the -s parameter for symbolic!“ls” list directory contents

RNA-SeqAlign2Ref AssembleT

pipeline_datasets

sample_read_list.txt*!Galaxy5-brain_2.fastq*!Galaxy4-brain_1.fastq*!Galaxy3-adrenal_2.fastq*!Galaxy2-adrenal_1.fastq*!Galaxy1-iGenomes_UCSC_hg19_chr19_gene_annotation.gtf*!hg19.fa*

notes.txt

notes.txt

notes.txt

notes.txt

notes.txt

notes.txt

notes.txt

notes.txt

notes.txt

Files and directories

$ ln -s /homes/bioinfo/pipeline_datasets/ ./ $ ls pipeline_datasets@ $ ls pipeline_datasets/RNA-SeqAlign2Ref/ sample_read_list.txt* Galaxy5-brain_2.fastq* Galaxy4-brain_1.fastq* Galaxy3-adrenal_2.fastq* Galaxy2-adrenal_1.fastq* Galaxy1-iGenomes_UCSC_hg19_chr19_gene_annotation.gtf* hg19.fa*

“ln” or link program with the -s parameter for symbolic!“ls” list directory contents

RNA-SeqAlign2Ref AssembleT

pipeline_datasets

sample_read_list.txt*!Galaxy5-brain_2.fastq*!Galaxy4-brain_1.fastq*!Galaxy3-adrenal_2.fastq*!Galaxy2-adrenal_1.fastq*!Galaxy1-iGenomes_UCSC_hg19_chr19_gene_annotation.gtf*!hg19.fa*

notes.txt

notes.txt

notes.txt

notes.txt

notes.txt

notes.txt

notes.txt

notes.txt

notes.txt

Files and directories

$ ln -s /homes/bioinfo/pipeline_datasets/ ./ $ ls pipeline_datasets@ $ ls pipeline_datasets/RNA-SeqAlign2Ref/ sample_read_list.txt* Galaxy5-brain_2.fastq* Galaxy4-brain_1.fastq* Galaxy3-adrenal_2.fastq* Galaxy2-adrenal_1.fastq* Galaxy1-iGenomes_UCSC_hg19_chr19_gene_annotation.gtf* hg19.fa*

“ln” or link program with the -s parameter for symbolic!“ls” list directory contents

RNA-SeqAlign2Ref AssembleT

pipeline_datasets

sample_read_list.txt*!Galaxy5-brain_2.fastq*!Galaxy4-brain_1.fastq*!Galaxy3-adrenal_2.fastq*!Galaxy2-adrenal_1.fastq*!Galaxy1-iGenomes_UCSC_hg19_chr19_gene_annotation.gtf*!hg19.fa*

notes.txt

notes.txt

notes.txt

notes.txt

notes.txt

notes.txt

notes.txt

notes.txt

notes.txt

Files and directories

$ ln -s /homes/bioinfo/pipeline_datasets/ ./ $ ls pipeline_datasets@ $ ls pipeline_datasets/RNA-SeqAlign2Ref/ sample_read_list.txt* Galaxy5-brain_2.fastq* Galaxy4-brain_1.fastq* Galaxy3-adrenal_2.fastq* Galaxy2-adrenal_1.fastq* Galaxy1-iGenomes_UCSC_hg19_chr19_gene_annotation.gtf* hg19.fa*

“ln” or link program with the -s parameter for symbolic!“ls” list directory contents

RNA-SeqAlign2Ref AssembleT

pipeline_datasets

sample_read_list.txt*!Galaxy5-brain_2.fastq*!Galaxy4-brain_1.fastq*!Galaxy3-adrenal_2.fastq*!Galaxy2-adrenal_1.fastq*!Galaxy1-iGenomes_UCSC_hg19_chr19_gene_annotation.gtf*!hg19.fa*

notes.txt

notes.txt

notes.txt

notes.txt

notes.txt

notes.txt

notes.txt

notes.txt

notes.txt

Relative paths

$ ls /homes/bioinfo $ ls ../../bin ls ln rm mkdir… $ ls ../bioinfo/bioinfo_software cufflinks@ tophat@ samtools@… $ ls ~/pipeline_datasets Galaxy5-brain_2.fastq* Galaxy4-brain_1.fastq*…

root/

tmp homes bin

user1 bioinfo user2

“ls” list directory contents!.. one directory up from the current working directory!. current working directory!~ home directory

Relative paths

$ ls /homes/bioinfo $ ls ../../bin ls ln rm mkdir… $ ls ../bioinfo/bioinfo_software cufflinks@ tophat@ samtools@… $ ls ~/pipeline_datasets Galaxy5-brain_2.fastq* Galaxy4-brain_1.fastq*…

root/

tmp homes bin

user1 bioinfo user2

“ls” list directory contents!.. one directory up from the current working directory!. current working directory!~ home directory

Relative paths

$ ls /homes/bioinfo $ ls ../../bin ls ln rm mkdir… $ ls ../bioinfo/bioinfo_software cufflinks@ tophat@ samtools@… $ ls ~/pipeline_datasets Galaxy5-brain_2.fastq* Galaxy4-brain_1.fastq*…

root/

tmp homes bin

user1 bioinfo user2

“ls” list directory contents!.. one directory up from the current working directory!. current working directory!~ home directory

Relative paths

$ ls /homes/bioinfo $ ls ../../bin ls ln rm mkdir… $ ls ../bioinfo/bioinfo_software cufflinks@ tophat@ samtools@… $ ls ~/pipeline_datasets Galaxy5-brain_2.fastq* Galaxy4-brain_1.fastq*…

root/

tmp homes bin

user1 bioinfo user2

“ls” list directory contents!.. one directory up from the current working directory!. current working directory!~ home directory

Relative paths

$ ls /homes/bioinfo $ ls ../../bin ls ln rm mkdir… $ ls ../bioinfo/bioinfo_software cufflinks@ tophat@ samtools@… $ ls ~/pipeline_datasets Galaxy5-brain_2.fastq* Galaxy4-brain_1.fastq*…

root/

tmp homes bin

user1 bioinfo user2

“ls” list directory contents!.. one directory up from the current working directory!. current working directory!~ home directory

Navigate and create directories

$ cd ~/pipeline_datasets/RNA-SeqAlign2Ref $ ls sample_read_list.txt* Galaxy5-brain_2.fastq* Galaxy4-brain_1.fastq* Galaxy3-adrenal_2.fastq* Galaxy2-adrenal_1.fastq* Galaxy1-iGenomes_UCSC_hg19_chr19_gene_annotation.gtf* hg19.fa* $ pwd /homes/bioinfo/pipeline_datasets/RNA-SeqAlign2Ref $ mkdir test $ ls test…

“cd” change directories!“mkdir” make directories

Navigate and create directories

“touch” creates files!“rm” deletes files!or use cyberduck

Navigate and create directories

“touch” creates files!“rm” deletes files!“nano” is a commandline file editor!or use cyberduck!!

Software carpentry v.5 http://software-carpentry.org/v5/gloss.html!Software carpentry v.4 http://software-carpentry.org/v4/shell

Navigate and create directories

“touch” creates files!“rm” deletes files!“nano” is a commandline file editor!or use cyberduck!!

Software carpentry v.5 http://software-carpentry.org/v5/gloss.html!Software carpentry v.4 http://software-carpentry.org/v4/shell

Move files or directories

$ mv ~/pipeline_datasets/test.txt ~/test.txt $ ls ~ test.txt…

“mv” move files or directories to a new location

Unix wildcards and head/tail

$ ls ~/pipeline_datasets/RNA-SeqAlign2Ref/*.fastq pipeline_datasets/RNA-SeqAlign2Ref/Galaxy5-brain_2.fastq* pipeline_datasets/RNA-SeqAlign2Ref/Galaxy4-brain_1.fastq* pipeline_datasets/RNA-SeqAlign2Ref/Galaxy3-adrenal_2.fastq* pipeline_datasets/RNA-SeqAlign2Ref/Galaxy2-adrenal_1.fastq* $ head ~/pipeline_datasets/RNA-SeqAlign2Ref/*.fastq ==> pipeline_datasets/RNA-SeqAlign2Ref/Galaxy2-adrenal_1.fastq <== @ERR030881.107 HWI-BRUNOP16X_0001:2:1:13663:1096#0/1 ATCTTTTGTGGCTACAGTAAGTTCAATCTGAAGTCAAAACCAACCAATTT + 5.544,444344555CC?CAEF@EEFFFFFFFFFFFFFFFFFEFFFEFFF…

“*” any character 0 or 1 times (can be used with most basic Unix commands)!“head” prints first 4 lines of a file “tail” prints the last

Common bioinformatics file formats

@ERR030881.107 HWI-BRUNOP16X_0001:2:1:13663:1096#0/1 ATCTTTTGTGGCTACAGTAAGTTCAATCTGAAGTCAAAACCAACCAATTT + 5.544,444344555CC?CAEF@EEFFFFFFFFFFFFFFFFFEFFFEFFF

Fastq: sequence data with quality scores. Four lines per entry header line, sequence, second header or +, base quality scores. http://en.wikipedia.org/wiki/FASTQ_format

>Locus_1_Transcript_2/3_Confidence_0.333_Length_600 CCCCCCTTCAGTTCCCTTAAAGCACAGCCCAGGGAAACCTCCTCACAGTTTTCATCCAGC CACGGGCCAGCATGTCTGGGGGCAAATACGTAGACTCGGAGGGACATCTCTACACCGTTC CCATCCGGGAACAGGGCAACATCTACAAGCCCAACAACAAGGCCATGGCAGACGAGC

Fasta: sequence data. Header line that begins with “>”, sequence (generally wrapped). http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml

Common bioinformatics file formats

!HWUSI-EAS1794_0001_FC61KOJ:5:110:7624:5467#0 99 Locus_126_Transcript_1 6319 1 50M = 6478 209 GCTTGTGGCAT IIIIIIIIIIII HWUSI-EAS1794_0001_FC61KOJ:5:110:7624:5467#0 147 Locus_126_Transcript_1 6478 1 50M = 6319 -209 GACGTTCGTGAT IHIIHHIIIIII

Sam: sequence alignment. Tab delimited file with eleven required feilds. http://samtools.github.io/hts-specs/SAMv1.pdf

Bam: binary version of a sam file.

Read header MAPQ

Target header!

Read seq

Read quality

Pipes

Standard!input Stdin

!Software carpentry v.4 http://software-carpentry.org/v4/shell

Pipes

Standard!input Stdin

Standard!input Stdin

“|” passes output from some kinds of programs as input to other programs to chain together steps!“>” tells the shell to print the output to a file rather than display on the screen

!Software carpentry v.4 http://software-carpentry.org/v4/shell

Pipes

!$ cd ~/pipeline_datasets/RNA-SeqAlign2Ref $ wc -l *.fastq > lines

wc

lines

!Software carpentry v.4 http://software-carpentry.org/v4/shell

Pipes

!$ wc -l *.fastq | sort > lines

wc sort

lines

!Software carpentry v.4 http://software-carpentry.org/v4/shell

Pipes

!$ wc -l *.fastq | sort | head -1 > lines

lines

wc sort head -1

!Software carpentry v.4 http://software-carpentry.org/v4/shell

Pipes and grep

!$ wc -l *.fastq | sort | head -1 > lines

Pipes and grep

This programming model called pipes and filters.

!$ wc -l *.fastq | sort | head -1 > lines

Pipes and grep

This programming model called pipes and filters.

!$ wc -l *.fastq | sort | head -1 > lines

Pipes and grep

This programming model called pipes and filters.

A filter transforms a stream of input into a stream of output

!$ wc -l *.fastq | sort | head -1 > lines

Pipes and grep

This programming model called pipes and filters.

A filter transforms a stream of input into a stream of output

!$ wc -l *.fastq | sort | head -1 > lines

Pipes and grep

This programming model called pipes and filters.

A filter transforms a stream of input into a stream of output

A pipe connects two filters

!$ wc -l *.fastq | sort | head -1 > lines

Pipes and grep

This programming model called pipes and filters.

A filter transforms a stream of input into a stream of output

A pipe connects two filters

!$ wc -l *.fastq | sort | head -1 > lines

Pipes and grep

This programming model called pipes and filters.

A filter transforms a stream of input into a stream of output

A pipe connects two filters

Any program that reads lines of text from standard input, and writes lines of text to standard output, can work with every other

!$ wc -l *.fastq | sort | head -1 > lines

Pipes and grep

This programming model called pipes and filters.

A filter transforms a stream of input into a stream of output

A pipe connects two filters

Any program that reads lines of text from standard input, and writes lines of text to standard output, can work with every other

!$ wc -l *.fastq | sort | head -1 > lines

Pipes and grep

$ cd ~/pipeline_datasets/sam_bam !$ /homes/bioinfo/bioinfo_software/samtools/samtools cat brain_rep_1_tophat2_out/accepted_hits.bam adrenal_rep_1_tophat2_out_1/accepted_hits.bam | /homes/bioinfo/bioinfo_software/samtools/samtools flagstat - > alignment_stats.txt !$ grep -c ">" ../RNA-SeqAlign2Ref/hg19.fa

Pipes and grep

“|” passes output from some kinds of programs as input to other programs to chain together steps

$ cd ~/pipeline_datasets/sam_bam !$ /homes/bioinfo/bioinfo_software/samtools/samtools cat brain_rep_1_tophat2_out/accepted_hits.bam adrenal_rep_1_tophat2_out_1/accepted_hits.bam | /homes/bioinfo/bioinfo_software/samtools/samtools flagstat - > alignment_stats.txt !$ grep -c ">" ../RNA-SeqAlign2Ref/hg19.fa

Pipes and grep

“|” passes output from some kinds of programs as input to other programs to chain together steps“-” tells samtools program to use the output from the previous step as input

$ cd ~/pipeline_datasets/sam_bam !$ /homes/bioinfo/bioinfo_software/samtools/samtools cat brain_rep_1_tophat2_out/accepted_hits.bam adrenal_rep_1_tophat2_out_1/accepted_hits.bam | /homes/bioinfo/bioinfo_software/samtools/samtools flagstat - > alignment_stats.txt !$ grep -c ">" ../RNA-SeqAlign2Ref/hg19.fa

Pipes and grep

“|” passes output from some kinds of programs as input to other programs to chain together steps“-” tells samtools program to use the output from the previous step as input“>” tells the shell to print the output to a file rather than display on the screen

$ cd ~/pipeline_datasets/sam_bam !$ /homes/bioinfo/bioinfo_software/samtools/samtools cat brain_rep_1_tophat2_out/accepted_hits.bam adrenal_rep_1_tophat2_out_1/accepted_hits.bam | /homes/bioinfo/bioinfo_software/samtools/samtools flagstat - > alignment_stats.txt !$ grep -c ">" ../RNA-SeqAlign2Ref/hg19.fa

Pipes and grep

“|” passes output from some kinds of programs as input to other programs to chain together steps“-” tells samtools program to use the output from the previous step as input“>” tells the shell to print the output to a file rather than display on the screen“grep” searches for patterns in a file. The “-c” parameter tells greps to count lines with the pattern (in this case we can count contigs in a fasta).

$ cd ~/pipeline_datasets/sam_bam !$ /homes/bioinfo/bioinfo_software/samtools/samtools cat brain_rep_1_tophat2_out/accepted_hits.bam adrenal_rep_1_tophat2_out_1/accepted_hits.bam | /homes/bioinfo/bioinfo_software/samtools/samtools flagstat - > alignment_stats.txt !$ grep -c ">" ../RNA-SeqAlign2Ref/hg19.fa

Pipes with samtools

!$ /homes/bioinfo/bioinfo_software/samtools/samtools

https://www.biostars.org/p/43677/!!http://samtools.sourceforge.net/pipe.shtml

Review Unixps -p $$ process status for the process id of the current shell

pwd print working directoryln -s create link with the -s parameter for symbolic

ls list directory contents.. one directory up from the current working directory. current working directory~ home directory* wildcard

cd change directoriesmkdir make directories

mv moves files or directorieshead prints first four lines of a filetail prints last four lines of a file| chains programs together

grep searches for patternswget non-interactive network downloader

Review NGS

samtools cat concatenate BAMs

samtools flagstat simple stats

samtools view SAM<->BAM conversion

samtools sort Sort alignments by leftmost coordinates

samtools rmdup Remove potential PCR duplicates