UNIX and Perl Lecture 2 Matt Hudson. Review Unix is text based: doesn’t waste computer resources...

47
UNIX and Perl Lecture 2 Matt Hudson

Transcript of UNIX and Perl Lecture 2 Matt Hudson. Review Unix is text based: doesn’t waste computer resources...

UNIX and Perl

Lecture 2

Matt Hudson

Review

• Unix is text based:

doesn’t waste computer resources on graphics

allows you to write and use scripts easily

makes remote access easy

don’t have to learn “where everything is”

gives the user more power

Review

• When negotiating file systems, it is important to remember the directory structure and the commands cd, ls and pwd.

• You must be very wary of creating multiple files with the same name, as it is easy to over-write an existing, important file

• There is no undelete or trash basket in UNIX – delete or overwrite a file and it is gone

Review

• Edit text files with nano. Standard format for bioinformatics text-based applications is called fasta:

>sequence id|more info|yet more info

ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG

CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT

CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC

CCACTAGCTGCATCGATG

Name, or ID, of sequence

Sequence itself, in one or manylines

Review

• The blastall command takes the following obligatory arguments

-p <program name, eg blastn, blastp>-i <input file name>

And these arguments are also very important

-d <database name>-e <E-value cutoff>-m <matrix name>

Using your UNIX skills

• Now we’re going to use our new UNIX skills in anger with some more sophisticated bioinformatics programs.

• It’s an idea to make an “experiment” or “scratch” folder

• Use this as a place for stuff that might explode….

Bioinformatics programs

• blastall:blastpblastnblastxbl2seq… etc

• HMMERhmmpfamhmmsearchhmmbuild

• clustalw• fasta34

The background

• If you are running a program that takes a long time, especially if redirecting output to a file, put it in the background, and you can keep working.

Either put & at the end of the command

Or stop (ctrl-Z) then bg %.

Running overnight

• If you are running a program overnight, use qsub from the head node to control the program command – this way the command will keep running when you exit the shell.

• Don’t forget to redirect output to a file when doing this (can still get output but can be hard to figure out).

Viewing running processes

• You can see all the processes on the system, ranked by how much memory and CPU time they are using.

$top

Fasta format

• The standard format for nucleotide and protein sequence is fasta, named after the program. It is very easy to read and write manually or with a program:

>sequence id|more info|yet more info

ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG

CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT

CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC

CCACTAGCTGCATCGATG

Name of sequence

Sequence itself, in one or manylines

Multiple fasta format>sequence id|more info|yet more infoACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCGCAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGTCAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTCCCACTAGCTGCATCGATG

>sequence 2 id|more info|yet more infoACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCGCAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGTCAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTCCCACTAGCTGCATCGATG

>sequence 3 id|more info|yet more infoACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCGCAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGTCAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTCCCACTAGCTGCATCGATG

Multiple fasta format

•Multiple sequence formats are essential for doing batch or high-throughput work

•Most bioinfomatics programs acceptmultiple sequences in this format

•Some websites still do, but most have stopped accepting this as people use too many resources.

DNA sequence output from ABI 377 (a gel-based sequencer)

1. Trace files (dye signals) are analyzed and bases called to create chromatograms.

2. Chromatograms from opposite strands are reconciled with software to create double-stranded sequence data.

Quality and phred

• When manually interpreting Sanger sequence, you interpret the quality of the base intuitively.

• Phil Green’s program “phred” made genome sequencing possible by doing this mathematically.

Fasta + quality

• This is the standard output of sanger platforms – two files

• It’s great and easy to read, but takes up a lot of disk space (at least 4 bytes / base)

>sequence id|more info|yet more infoACCCGTGA

>sequence id|more info|yet more info9 13 20 24 26 30 29 30

fastq

• Much less fun to read, but only two bytes per base and only one file.

• Encoding varies. Illumina 1.3+ format can encode a Phred quality score from 0 to 62 using ASCII 64 to 126

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

• These formats originated as ways to store short reads aligned to a reference genome

• Now a lot of Illumina raw data also comes as bam

• The sam format is a tab delimited text one that is possible to edit in Perl, but bam is binary – difficult to use Perl for.

• The samtools utility (installed on biocluster) can convert between sam, bam, fastq and some other formats.

sam and bam

The BLAST command

• The blast command is blastall• you need to tell it what program to use,

what database, and what input file. Many other options are available.

• e.g.

[user@server ~]$ blastall -p myprogram -i myfile.txt -d mydatabase

genomics programs

• Blast:formatdbblastall:

blastpblastnblastx

bl2seq… etc• novoalign

novoindexnovoalign

• bowtiebowtie-buildbowtie

Noticing a pattern?

All of these programsgenerate an “index” filewhich speeds up searchingof the query (short) againstthe “database” (big)

Index files

Making your own database

• You can use formatdb to make your own BLAST database

[user@server ~]$ wget ftp://ftp.ncbi.nih.gov/blast/db/swissprot.tar.gz

[user@server ~]$ tar –xzvf swissprot.tar.gz

[user@server ~]$ formatdb –i swissprot –p T

[user@server ~]$ blastall –p blastp –i exampleprotein.txt –d swissprot

Hidden Markov models

• The HMMER package:

hmmpfam Search for matches to the pfam database

hmmbuild Make your own model

hmmsearch Search a protein file with your model

Let’s try it. But not all at once – very compute intensive.

[user@server ~]$ nice –n 10 hmmpfam

/home/bio/db/pfam testprotein.txt

clustalw

• The most commonly used alignment program.

• Try aligning the proteins in “testprotein.txt”

• You can use the alignment for phylogenetic analysis, or to create a hidden Markov model.

Making a HMM

• Using the .aln file from your clustalw output:

[user@server ~]$ hmmbuild myhmm testprotein.aln

This creates a model of your alignment that you can use to search for sequences that belong in it.

Searching with the HMM

[user@server ~]$ hmmsearch myhmm testprotein.txt

• In this example we’re searching against the file we used to make the hmm.

• But you could search against the whole of genbank or swissprot, or against a whole genome, to find proteins with structural similarity to yours.

Fasta

• The fasta34 program is also installed on mrmarsh

• Fasta34 is an alternative to BLAST that is slower, but provides more accurate output, and can use any fasta format file as a database.

• Try searching exampleprotein.txt against testprotein.txt

[user@server ~]$ fasta34

Getting information from output files

• Often these are huge text files

• grep is a great tool for getting at the nitty-gritty.

• awk is more powerful, but mostly involves writing scripts, and has been largely superseded by Perl.

grep• My favorite of all UNIX commands

• “global regular expression and print”

• Allows you to pick out lines of a text file that match a query, count them, and retrieve lines around the match.

grep - continued

grep ‘Query=’ myblast.txtWhat sequences did I BLAST?

grep –c ‘>’ testprotein.txtHow many sequences are in this file?

grep –A 10 ‘>’ testprotein.txtGive me the first ten lines of each protein

Getting files from remote servers

• Before there was the world wide web, there was ftp.

• Note that WORKER NODES ARE NOT CONNECTED TO THE INTERNET, so do network stuff from the head node.

[user@server ~]$ ftp ftp.ncbi.nih.gov

ftp commands

• open open a connection• ls same as UNIX• cd same as UNIX• get get me this file• mget get more than one file• put put a file on the server• lcd local cd• close close connection• bye exit the ftp program

Secure ftp

• Although NCBI allows you to connect using ftp, this is because they have only public files, and they don’t let you upload anything.

• Most UNIX computers disallow ftp logins. However, if you can ssh to a computer, you can also use sftp. The commands are identical to ftp, but you can access your own files securely.

wget

• But what if you want to get a file which is available for download from a website, but not by ftp?

• wget will get the contents of any URL and put them in a file.

[user@server ~]$ wget www.cnn.com

How do I write a script, then?

[user@server ~]$ nano myscript.pl

All programming courses traditionally start with a program that prints “Hello, world!”. So in keeping with that tradition:

Note:

No line numbers.

Each command line ends with a semicolon

So what’s a program look like?

#!/usr/bin/perl

print “Hello, world\n”;

Exit and run

Control-O then control-x.

[user@server ~]$ perl myscript.pl

Or, if you’re feeling fancy

[user@server ~]$ chmod 755 myscript.pl

[user@server ~]$./myscript.pl

This makes the file EXECUTABLE.

Those numbers after chmod are octal numbers.. But don’t worry too much about that.

GNU

• GNU stands for “GNU’s Not Unix”• This is what computer people think is a joke• The reason for GNU is that the name “UNIX” was

owned by private companies. GNU exists to make free software for UNIX, but couldn’t use the name

• Linux grew out of GNU• Not only is GNU software free, but also all the way

in which the software was made (the “source code”) is made public and easily downloaded.

Downloading programs

• “ready to run” programs are called binaries in unix-speak.

• They are often “zipped” in a .tar.gz file.

• To unzip, use gunzip and tar –xvf• To run, specify the path to the program.

E.g., ./program or /home/matt/bin/program

• You can download programs for UNIX just as you would for a PC

Bowtie

• Bowtie is a more heavily indexed search program, which requires a more exact match than BLAST. It is much, much faster.

• See if you can figure this out.

• Use bowtie-build to build the database, bowtie to search it…

Your path

• To see your path, type echo $PATH• If you are bored with typing the full path to

programs, you can put them in your path.• Eg.

mkdir ~/bin/mv program ~/bin/export PATH=$PATH:~/binprogram

Source Code

• Most bioinformatics software is free, and open source. That is, you can download the actual instructions the programmer wrote.

• This is great, because it means you can install these programs on almost any machine.

• If somebody asks you for money for bioinformatics software, DON’T DO IT!

The GNU install pragma

• GNU source code can be complicated to compile, so it comes with programs to help you. There is a standard way to build and install GNU software.

[user@server ~]$ gunzip program.tar.gz[user@server ~]$ tar –xvf program.tar[user@server ~]$ cd program[user@server ~]$ ./configure

[user@server ~]$ make[user@server ~]$ make install

The root user

• Most UNIX machines have an account called “root”

• root can see everything, change everything, delete everything, including other users work

• Unless you buy your own machine, nobody sane will give you root access

• You usually need root access to install programs in the default location. But you can put them in your home directory instead.

UNIX summary

• Use ls, cd, mv, cp, nano and friends to deal with files and directories

• Install, or compile, any program you like. Most are free.

• Use blastall, hmmer etc on the command line for high throughput work. Transfer the output to a file for best results and run in the background. Grep the output file to get pertinent information….

• Or process it with a Perl script