Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some...

Post on 11-Jan-2016

219 views 2 download

Transcript of Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some...

Unix for Bioinformaticists: Unix Tools, Emacs, and Perl

helpdesk at stat.rice.eduAug 2004Some slides are borrowed from Dr. Woely’s (BCM) presentation.

Do I Have to Know/Use Unix?

Simple answer: no. Windows can do almost everything.

Complicated answer: yes, if youare lazy (would like to automate things) are good at reading manuals and writing scriptswant to make better use of your machineare as poor as I am (can not afford pricy windows software) especially if you will be a bioinformaticist

Why Unix Is Useful in Bioinformatics

Many tasks involve processing on large text based datasets. Unix tools in many cases are better than their windows counterparts.You may need to use several tools to accomplish a task. Windows is not particularly good at gluing them.When you need more CPU power, servers and clusters are usually *nix-based. Many tools are available only under Unix-like systems.

Outline

Unix in generalUnix toolsEmacsPerl

Unix Commands

Single command:

> foreach file (*.txt)

sort –k1 $file > $file:r_sorted.txt

end

> sort –k1 file.txt

Combine other commands:> sort –k1 file.txt | grep “Tag=Mouse” > output.txt

Operate multiple files:

More commands

> rename .html .htm *.html

There are many such convenient tools. Scripts can be used if you can not find one,

> foreach f (*.html)

mv $f $f:r.htm

end

More commands

> convert –rotate 90 file.jpg file.png

Convert a .jpg file to .png format after rotating 90 degrees.

> wget -r -l1 --no-parent -A.tar.gz -Ppackages http://cran.r-project.org/src/contrib/PACKAGES.html

download all .tar.gz files to packages directory, This command can do everything ‘teleport’ etc. under windows can do.

A shell script: lyx2pdf

#!/bin/csh

set file = $1:r

lyx --export latex $file.lyx

latex $file.tex

dvips -o $file.ps $file.dvi

ps2pdf $file.ps

> lyx2pdf myfile.lyx

A Makefile%.html: %.tex

latex2html -local_icons -no_subdir -split 0 $*.tex

%.tex: %.lyx

lyx2tex $*.lyx

%.dvi: %.tex

latex $*.tex

%.ps: %.dvi

dvips -o $*.ps $*.dvi

%.pdf: %.ps

ps2pdf $*.ps

> make file.dvi> make file.ps> make file.pdf

A Perl Script

#!/usr/bin/perl

# read all the things at once

undef $/;

# read in the file and look for /* */

($comm) = <> =~ /.*\/\*(.*)\*\//ms;

# print comments

print $comm, "\n";

crontab

# do not forget to renew your library books

0 0 15 7 * mail bpeng@rice.edu %subject reminder Renew all the books!

# backup your files to server every day at 6AM

6 * * * * /usr/local/bin/rsync -avz /home/bpeng thor.stat.rice.edu::backup > logfile

Graphviz

digraph G

{

A->B->C

B->D->C

}

File: try.dot

> dot –Tps try.dot –o try.eps

Useful (and free) tools

Servers: Apache, openssh, openldap

Web: Mozilla/firefox, Konqueror, lynx

Mail clients: Pine, Mutt, Mozilla/thunderbird, kmail, evolution

Text processing: tetex/lyx, open office, koffice

Languages: gcc, Perl, python, gmake, kdevelop

Scientific libraries and tools: GNU Scientific Library, bioPython, bioPerl, R, Graphviz, gnuplot, octave

Misc: VNC, wget,

Unix text-processing toolsAccess to Unix

Mac OSX + developers kit Linux Stat and ruf/owlnet servers (Solaris) Windows + cygwin

Tools - in contrast to Excel, faster, operate on larger files

Grep, Pipes, Sort, Comm, Diff, Join Sed - regular expression substitution editor, replaced by perl in

most contexts Man - to list manual pages with options for most commands (if

installed and concurrent version)

Grep

Grab lines that match a text phraseOnly the line that matchesLines before or after the matched lineLines that do not matchPiping multiple searches

GenBank Files

Grab the Locus, Definition and Keyword lines

phase2.txt.out

temp

Select Non-Human Definition Lines and Use Pipe

temp

kworley% grep -v Homo temp | grep DEF

Specify Lines to returngrep -1

grep -B1

grep -A1

Sort

In dictionary (-d), month (-M), or numerical (-n) orderIgnore case (-f)Specify output file (-o)Specify the separator between fields (-t)Unique lines only (-u)Specify field on which to sort (-k POS1,[-POS2]), numbered starting from 0, can specify which character in the field (field.char)Merge more than one sorted file (-m)

Comm

Select or reject lines in common between two sorted filesOptions suppress printing of columns comm [-123] file1 file2 Column 1 is lines only in file 1 Column 2 is lines only in file 2 Column 3 is lines in both files

Diff

Compares two files (or sets of files in a directory) and output lines with differencesCompare as text (-a)Ignore changes in white space (-b) or blank lines (-B), case difference (-i)For directory comparisons Report only files that differ not details (-q) Compare subdirectories recursively (-r)

Join

Combines lines from two files based on a common field (-1 field -2 field)Specify the fields from each file and the order to output (-o file_number.field file_number.field file_number.field)

What is Emacs?

A Unix text editor with additional functionalityColumn functionsSettings for DNA modeSettings for programming modeSeamless integration with matlab, R, S-Plus, SAS etc.

Emacs Demonstrations

Search and replace By query All New lines Counting things

Column functions Select Kill Copy Paste

Query replace

Esc % Replace phrase With phrase Designate carriage return with control Q

control J

Y or N! To replace all

Starting File

Query Replace

End file

Rectangle functions

Mark, select rectangleControl x r r a

To register the rectangle as buffer a k

To kill the rectangle r i a

To insert previously registered rectangle a from buffer

Select Rectangle, Kill

Select Rectangle, Mark, Insert

What is Perl?

A general purpose programming language.Invented to replace awk, sed, and sh.A scripting language.Practical Extraction and Reporting LanguagePathologically Eclectic Rubbish Lister

“There is more than one way to do it” TIMTOWTDI

How to Use Perl

Perl “scripts” (programs) are text and are interpreted by the the perl program.TIMTOWTDI:

You can put the script on the command line:>perl -e 'print "Hello, world!\n";'

You can pass it as an argument to perl:>perl my_program.pl

You can make the script self-executing:>my_program.pl

print, ", ', \n

'print "Hello, world!\n";'

In most programming languages, "print" means "display" or "output".The single and double quote characters ( " ' ) are used to set apart blocks of "text". In this example, the single quote sets apart the perl script, and the double quotes sets apart the text to display. (Perl has others ways to quote.)The backslash, '\', is used to change the meaning of a character, e.g. to generate special characters. \n means "start a new line" (e.g. the Carriage Return, or Return, or Enter.)

Example of a One Liner(Thanks to Dr. Wheeler)

perl -nle '@f=split/\t/; print if ($f[2] > 95 );' blast_tbl_in.txt > blast_tbl_out.txt

perl -nle

'@f=split/\t/; print if ($f[2] > 95);'

blast_tbl_in.txt >blast_tbl_out.txt

A One Liner: TIMTOWTDI

1. perl -nle '@f=split/\t/; print if ($f[2] > 95 );' blast_tbl_in.txt > blast_tbl_out1.txt

2. perl -ne '@f=split; print if ($f[2] > 95 );' blast_tbl_in.txt > blast_tbl_out2.txt

3. perl -ane 'print if ($F[2] > 95 );' blast_tbl_in.txt > blast_tbl_out3.txt

split, if, variables@f=split/\t/; print if ($f[2] > 95);

split is a function. It can be written with parens like in most languages, and takes UP TO three arguments:split( where_to_split, what_to_split, how_many_to_split)split, like many Perl statements, uses defaults for missing arguments.Special characters mark @whole_arrays, $array_members[1], %whole_hashes, $hash_members{'one'}, $simple_variables.if acts like its common English meaning. It can go before a block or at the end of a statement (as above).Perl converts between numbers and text. '>' is a numeric operator so 95 and $f[2] are treated as numbers. If gt replaced >, they would be treated as strings.

FASTA to XML

perl -pi.bak -e 's"^>(.*)$"</seq><title>\1</title><seq>";'

test.fa

[localhost:~/test] steffen% lstest.fa test.fa.bak[localhost:~/test] steffen% perl -pi.bak -e 's"^>(.*)$"</seq><title>\1</title><seq>";' test.fa[localhost:~/test] steffen% lstest.fa test.fa.bak[localhost:~/test] steffen% more test.fa</seq><title>CSTAP1E0101A</title><seq>gttgcctgcgtcttcggxaacaacgtagttctcagGCCGCCCGACCAGGTACTTTTTTGCTTTTTTTTTTTTTATTTTTTACAAATTATCAAAAGTTCTTGTGCTTTCAGGAGCGATTAACATTCTCATGGGCCATACCCTTGTCAGGTTTCATAAACTAAGTTAGATGGACCTGCTTGGTATTGTGGTGGAAGACCTCCAAGAAAACAAAGTCCCGGAATCTCAACGTCCTCTGTCTTCTGGCATTTCATCTTCAAGAAACAATGTCTTATAGTTATTATTGCATGTTTTGGGAGGTTAAAGGGTAAAGTTTGTAATGCCTTGACTAAAAACTTCCAGTTGTTATGGTGcacaacaatttttggtatgctaacttatacttgtgcctaatccttaaggaaaagaaagagccatatacctaaaactgactttatttttcaaaaggta</seq><title>CSTAP1E0102A</title><seq>tttttgctggcgaactatcaggagactacagxaactacttttcagtxcgaactcacatcatcactggccgtcgttttacaacgtcgtgattgggaaaaccctggcgttacccaacttaatcgccttgcagcacatccccctttcgccagctggcgtaatagcgaagaggcccgcaccgatcgcccttcccaacagttgcgcagcctgaatggcgaatggcgcctgatgcggtattttctccttacgctttcaatgatgagcacttxtaaaggtctgx</seq><title>CSTAP1E0103A</title><seq>atttgagcagcatctattgaaaactaxcgxagxtcttcaggcgcgCCCACCCGAGGTACTACCAAGCCAGTGTCCTGCCCGGTTTTAAGCCCTCGTCCTCTCCCTTCGCTCTCCTCCAAACTGAGCAGCATTAGTTCCACAAGCACAGAAGTTAAACGAAAAACTGTCTTGCTCCACGGTCTCCTACAGTAGAATGCTGGATAATAATGCTTTCAGAAGCCACTTCTACAACCAGAACATTCTGACCACCACAATCATCAGGTTTACACACACCCTACGAAACACTAGCGAGTTAACAAGactgatgaactacttgcagtcgaactccaatcattactggccgtcgttttaa

Executing a Perl Script in a File

$line = <>;$line =~ s">(.*)"<title>\1</title><seq>";print $line;

while( $line = <> ) {$line =~ s">(.*)"</seq><title>\1</title><seq>";print $line;

}

print "</seq>\n";

File Reading, Binding, while

$line = <>;<> reads one line from the "current file"

$line =~ s">(.*)"<title>\1</title><seq>";=~ makes the preceding string the "current line" (Binding)

while( $line = <> ) {print $line;

}Repeats the statements between { and } while there is another line.

Self-executing Perl Scripts

You need to know the path to your Perl program:>which perl/usr/bin/perl

The first line of your script must be:#!/usr/bin/perl

Permissions need to allow execution>chmod 755 my_program.pl

FASTA to XML Fleshed Out#!/usr/bin/perl## fasta2xml by David Steffen 6/2/2004# - Converts fasta file to mini-xml format

$inpfile = shift( @ARGV );

if( not( $inpfile =~ m/^(.*)\.fa$/ ) ) {die( "Input file, $inpfile, must be a fasta file and end in .fa\n" );

}$basefile = $1;

open( INPFILE, $inpfile ) or die( "Can't open $inpfile: $!\n" );

$outfile = '>' . $basefile . '.xml';open( OUTFILE, $outfile ) or die( "Can't open $outfile: $!\n" );

$line = <INPFILE>;$line =~ s">(.*)"<title>\1</title><seq>";print OUTFILE $line;

while( $line = <INPFILE> ) {$line =~ s">(.*)"</seq><title>\1</title><seq>";print OUTFILE $line;

}

print OUTFILE "</seq>\n";

Running Other Programs from Perl

$files = `ls`;The "backtic" (` `) characters execute the text in

between as a command to the operating system, returning the output of that command (e.g. to the $files) variable.

$error = system( "mv $file ${basefile}.abi" );The system statement executes its argument as a

command to the operating system, returning ERROR MESSAGES from that command. (Output is printed as usual.) There are other, subtle differences between ` ` and system.