Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

58
Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology

Transcript of Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Page 1: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Introduction to UNIX and Perl

Todd Scheetz

Sept. 6, 2001Computational Methods in Molecular Biology

Page 2: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Definitions

Operating System• provides a uniform interface between a computer’s hardware and user-level programs.• Manages the low-level functionality of the hardware automatically.

Programming Language• provides a formal structure/syntax for implementing algorithmic procedures.

Page 3: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

What is UNIX?

Operating system developed at Bell Labs.• originally written in assembly code• the C programming language was designed to implement a more portable version of UNIX

Multi-userMulti-tasking

Page 4: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

What is UNIX? (part 2)

Made available with source code at no cost• could fix bugs, add features or just test alternative methods• EXCELLENT for learning or teaching

Adopted by Berkeley to make BSD• virtual memory• paging• networking (TCP/IP)

Page 5: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

What is UNIX? (part 3)

By programmers, for programmers• extensive facilities to allow people to work together and share information in controlled ways• time sharing system

Basic Guidelines• Principle of least surprise• every program should do one thing and do it well

Page 6: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

UNIX hierarchy

Adapted from Tanenbaum, p. 273

Hardware (CPU, memory, disks, keyboard, etc.)

UNIX O/S(process mgmt, memory mgmt,file system, I/O, etc.)

Standard Libr.(open, close, fork,read, print, etc.)

Std. Utility Programs(shell, editor, compiler)

Users

User i/f

Library i/f

System call i/f

Page 7: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

UNIX Basics

User Accounts - required to log-on to the computer with username and password.

Groups - entity made up of one or more users.

Sharing...

Bob

Stacie

Diane

MikeBill

group1 group2

Page 8: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

UNIX Basics

File Sharing - Regulated by three sets of permissions.

Permissions: read, write, execute

Subjects: owner, group, all

R W XUser (u)Group (g)All (a)

-rwxr-xr-x foo.pl-r-xr-xr-x bar.pl-rw------- secret-rw-r--r-- public

Page 9: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

UNIX Basics

Super-user accountcomplete access to all files

Required for system administration tasksadd accounts/groupschange permissions/owners of any filechange password of any accountshutdown a machine

Page 10: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

UNIX BasicsUNIX Filesystem Hierarchy

/

bin etc usr vartmpdev lib

bin doc lib local

Two shortcuts. - the current directory.. - the directory one level “up”

/usr/usr/bin/usr/local/usr/local/bin

bin etc lib tmp

Page 11: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

What is UNIX?

Processes

Each program executes as a process

A process provides encapsulation for the program

Under UNIX, multiple processes can be running at the same time!

How to control processes:^C -- break^Z -- stop& -- start in backgroundps -- show which processes are runningkill -- kill a process

Page 12: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

What is UNIX?

grep - show every line from a file that matches a supplied patternEx. grep sub my_program.pl(would return every line in the file that contained the string ‘sub’)

ls - list filesEx. ls *.pl(would list all files in the current directory that end in ‘.pl’)

head - list the first lines in a fileEx. head -20 my_program.pl(would show the first 20 lines from my_program.pl)

sort - performs a lexical sorting of a fileEx. sort my_program.pl

Page 13: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

What is UNIX?

UNIX also provides a method for concatenating multiple programs together

Pipes…

Ex.head -20 *.pl | grep File | sort

pipes

Page 14: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

UNIX BasicsUNIX Command Summary

pwd - print working directorycd - change directoryls - list filesmv - move a file (relocate/rename)rm - remove a filecp - copy a file

mkdir - make a new directoryrmdir - remove a directorymore - display the contents of a file (one screen as a time)

chmod - change the permissions on a filechgrp - change the group associated with a file

Page 15: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

UNIX Shell

Shells

a.k.a. command interpreterthe primary user interface to UNIXinterpret and execute commands

1. Interactive use2. Customization of UNIX session (environment)3. programmability

/bin/sh - Bourne shell/bin/csh - C shell/bin/bash - Bourne again shell/bin/tcsh - modified, updated C shell

Page 16: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

UNIX Shell

bash

prompt -- by default shows who you are, what machine the shell is running on, and what directory you are in.

PATH -- environment variable that defines where the shell should look for the programs you are running.

/bin/usr/bin/usr/local/bin/usr/X11R6/bin/usr/sbin.

Page 17: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Installing Software

Pre-built vs. source

RPM vs. “raw” binaries

Processdownloadingextractingcompilinginstallationconfiguration

Page 18: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Mini-Tour of UNIX

Go through the most common commands.

Page 19: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Perl

Basics of a Perl program under UNIX

Perl is an interpreted language

The first line of a Perl program (in UNIX) is...#!/usr/bin/perl

The # character is the comment character.

All single-expression statements must end in a semi-colon.$area = $pi * $radius * $radius;while (CONDITION) {

# some stuff}

Page 20: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Programming Languages

Input/Output in Perl

Reading in from the keyboard...$line = <STDIN>;

Filehandles...

File: open(FH,”filename”);open(FH,”>filename”);...$line = <FH>;...close(FH);

DO HELLO WORLD WALK-THROUGH.

Page 21: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Programming Languages

Data Types

Integer - 0, 1, 2, …, 1000, 1001, …Floating Point - 0.0, 0.001, 0.0003, 3.14159265, …Character - a, b, c, d, …, 0, 1, 2, :, !, …

Different languages use different conventions. In Perl, a string is also a basic data type. A string is a sequence of 0 or more characters.

Page 22: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Programming Languages

Variables - Pieces of data stored within a program. (similar to variables in arithmetic)

scalar variables are distinguished by the ‘$’ at their front.

Any name beginning with a letter is allowed$a$a1$alphabet_soup_is_OK_to_me

Page 23: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Programming LanguagesArithmetic Operations

+ Addition- Subtraction* Multiplication/ Division

% Modulo++ Increment-- Decrement|| Logical OR

&& Logical AND! Logical Negation

Page 24: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Programming LanguagesArithmetic Operations

== Eq Equality!= neq Inequality> Greater than

>= … or equal to< Less than

<= … or equal to

Page 25: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Programming LanguagesStatements

A program can be broken down into basic structures called statements. Statements are terminated by a semi-colon.

print “Hello, world!\n”;

Assignment statements use a single ‘=‘ rather than the ‘==‘ of the equality operation.

$pi = 3.1415926;$area = $pi * $radius * $radius;$line = <STDIN>;

Page 26: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Programming Languages

Variable Types

Scalar - a single valueArray - a list of values (indexed by sequential number)Hash - a set of key,value pairs

Prime Numbers = (1, 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, …)

0 11 22 33 5

First 1Second 2Third 3Fourth 5

......... ...

Page 27: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Programming Languages

Arrays are good when the data is dense, and the algorithm uses a linear access pattern.

Prime Numbers = (1, 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, …)

1 1 1 0 1 0 1 0 0 0 1

0 1 2 3 4 5 6 7 8 9 10 11

0 1 2 3 4 5 6 7 8 9 10 11

1 2 3 5 7 11 13 17 19 23 29 31

Page 28: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Programming Languages

0 1 2 3 4 5 6 7 8 9 10 11

1 2 3 5 7 11 13 17 19 23 29 31

1 2 3 5 7 11 13 17 19 23 29 31

1 1 1 1 1 1 1 1 1 1 1 1

Hash - “associative array”• array indices can be any unique set of “keys”• excellent for accessing in random patterns (in sparse data)

(Ex. “is 19 a prime number?”)

Page 29: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Programming Languages

Scalar -- $foo, $a1, $a2000

Array -- @array, @iito access the element at index $i

$array[$i]

the last index of an array is $#array

the number of elements in an array is$num_elements = $#array + 1;

OR$num_elements = @array;

Page 30: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Programming Languages

Hash --%hash, %envto access the element with index of $i

$hash{$i}

to get a list of keys used in a hash@key_list = keys(%hash);

to determine how many keys are in a hash$num_elements = @key_list;

OR$num_elements = keys(%hash);

Page 31: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Programming Languages

Control of Program Execution

if -- executes a block of code, if the condition evaluates to TRUE

if($light eq “green”) {continue_driving();

}

if( ($light eq “green”) && ($no_traffic) ) {continue_driving();

}

Page 32: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Programming LanguagesIn many cases, a simple if statement is not sufficient, as multiple alternative outcomes need to be evaluated.

if($light eq “green”) {continue_driving();

} else {stop_car();

}

if($light eq “green”) {continue_driving();

} elsif($light eq “red”) {stop_car();

} else {go_fast_to_beat_the_yellow();

}

Page 33: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Programming Languages

Control of Program Execution

Sometimes you need to iterate through a statement multiple times...

Looping constructs:for (…) { … }foreach $var (@list) { … }while (COND) { … }

Page 34: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Programming Languages

Foreach Loop…

foreach $var (@list) {do_stuff($var);

}

foreach $name (@name_list) {print “Name = $name\n”;

}

foreach $name (@name_list) {if($hair_color{$name} eq “blond”) {

print “$name has blond hair.\n”;}

}

Page 35: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Programming Languages

for (INIT; COND; POST) {do_stuff();

}

for ($i=0; $i < 50;$i++) {print “i = $i\n”;

}

for ($i=0; $i < 50; $i++) {if($prime{$i} == 1) {

print “$i is prime!\n”;} else {

print “$i is not prime.\n”;}

}

Page 36: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Programming Languageswhile (COND) {

do_stuff();}

while($line = <FILE_HANDLE>) {print “$line”;

}

while($flag ==0) {if($prime{$position} == 1) {

$flag = 1;} else {

$position++;}

}

Page 37: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Intermission

Page 38: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Review of Perl Concepts

Data Typesscalararrayhash

Input/Outputopen(FILEHANDLE,”filename”);$line = <FILEHANDLE>;print “$line”;

Arithmetic Operations+, -, *, /, %&&, ||, !

Page 39: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Review of Perl Concepts

Control Structuresifif/elseif/elsif/else

foreach

for

while

Page 40: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Regular Expressions

General approach to the problem of pattern matching

RE’s are a compact method for representing a set of possible strings without explicitly specifying each alternative.

For this portion of the discussion, I will be using {} to represent the scope of a set.

{A}{A,AA}

{Ø} = empty set

Page 41: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Regular Expressions

In addition, the [] will be used to denote possible alternatives.

[AB] = {A,B}

With just these semantics available, we can begin building simple Regular Expressions.

[AB][AB] = {AA, AB, BA, BB}AA[AB]BB = {AAABB,AABBB}

Page 42: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Regular Expressions

Additional Regular Expression components* = 0 or more of the specified symbol+ = 1 or more of the specified symbol

A+ = {A, AA, AAA, … }A* = {Ø, A, AA, AAA, … }

AB* = {A, AB, ABB, ABBB, … }[AB]* = {Ø, A, B, AA, AB, BA, BB, AAA, … }

Page 43: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Regular Expressions

What if we want a specific number of iterations?

A{2,4} = {AA, AAA, AAAA}[AB]{1,2} = {A, B, AA, AB, BA, BB}

What if we want any character except one?[^A] = {B}

What if we want to allow any symbol?

. = {A, B}

.* = {Ø, A, B, AA, AB, BA, BB, … }

Page 44: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Regular Expressions

All of these operations are available in Perl

Several “shortcuts”

\d = {0, 2, 3, 4, 5, 6, 7, 8, 9}\w+\s\w+ = {…, Hello World, … }

Name Definition CodeWhitespace [space, tab,

new-line]\s

Wordcharacter

[a-zA-Z_0-9] \w

Digit [0-9] \d

Page 45: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Pattern Matching

Perl supports built-in operations for pattern matching, substitution, and character replacement

Pattern Matching

if($line =~ m/Rn.\d+/) {...

}

In Perl, RE’s can be a part of the string rather than the whole string.

^ - beginning of string$ - end of string

Page 46: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Pattern Matching

Back references…

if($line =~ m/(Rn.\d+)/) {$UniGene_label = $1;

}

Page 47: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Regular Expressions

$file = “my_fasta_file”;open(IN, $file);$line_count = 0;while($line = <IN>) {

if($line =~ m/^\>/) {$line_count++;

}}print “There are $line_count FASTA sequences in $file.\n”;

Page 48: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Pattern Matching

UniGene data file

ID Bt.1TITLE Cow casein kinase II alpha …EXPRESS ;placentaPROTSIM ORG=Caenorhabditis elegans; …PROTSIM ORG=Mus musculus; PROTGI=…SCOUNT 2SEQUENCE ACC=M93665; NID=g162776; …SEQUENCE ACC=BF043619; NID=…//ID Bt.2TITLE Bos taurus cyclin-dependent …...

Page 49: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Pattern Matching

Let’s write a small Perl program to determine how many clusters there are in the Bos taurus UniGene file.

Page 50: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Pattern Matching

Now we’ll build a Perl program that can write an HTML file containing some basic links based on the Bos taurus UniGene clustering.

Important:

http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Nucleotide&list_uids=GID_HERE&dopt=GenBank

Page 51: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Substitution

Pattern matching is useful for counting or indexing items, but to modify the data, substitution is required.

Substitution searches a string for a PATTERN and, if found, replaces it with REPLACEMENT.

$line =~ s/PATTERN/REPLACEMENT/;

Returns a value equal to the number of times the pattern was found and replaced.

$result = $line =~ s/PATTERN/REPLACEMENT/;

Page 52: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Substitution

Substitution can take several different options.specified after the final slash

The most useful areg - global (can substitute at more than one location)i - case insensitive matching

$string = “One fish, Two fish, Red fish, Blue fish.”;$string =~ s/fish/dog/g;print “$string\n”;

One dog, Two dog, Red dog, Blue dog.

Page 53: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Substitution

Example: Removing leading and trailing white-space

$line =~ s/^\s*(.*?)\s*$/$1/;

a *? performs a minimal match…it will stop at the first point that the remainder of the expression can be matched.

$line =~ s/^\s*(.*)\s*$/$1/;this statement will not remove trailing white-space, instead the white space is retained by the .*

Page 54: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Character Replacement

A similar operation to substitution is character replacement.

$line =~ tr/a-z/A-Z/;

$count_CG = $line =~ tr/CG/CG/;

$line =~ tr/ACGT/TGCA/;

$line =~ s/A/T/g;$line =~ s/C/G/g;$line =~ s/G/C/g;$line =~ s/T/A/g;

Page 55: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Character Replacement

while($line = <IN>) {$count_CG = $line =~ tr/CG/CG/;$count_AT = $line =~ tr/AT/AT/;

}$total = $count_CG + $count_AT;$percent_CG = 100 * ($count_CG/$total);

print “The sequence was $percent_CG CG-rich.\n”;

Page 56: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Subroutines

One of the most important aspects of programming is dealing with complexity. A program that is written in one large section is generally more difficult to debug. Thus a major strategy in program development is modularization.

Break the program up into smaller portions that can each be developed and tested independently.

Makes the program more readable, and easier to maintain and modify.

Page 57: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Subroutines

EXAMPLE:Reading in sequences from UniGene.all.seq file

Multiple FASTA sequences in a single file, each annotated with the UniGene cluster they belong to.

GOAL: Make an output file consisting only of the longest sequence from each cluster.

Page 58: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology.

Subroutines

ISSUES:1. Want to design and implement a usable program2. Use subroutines where useful to reduce complexity.3. Minimize the memory requirements.

(human UniGene seqs > 2 GB)