BINF 634 Fall 2015 Lect 5 1
BINF634 Lecture 5 Program 1 Solution
Quiz 2 Solution
Program 2 Discussions
Regular Expressions
Regular Expressions Lab
Time to Work on Program 2
Outline
Program 1 Discussions You must test all of your code on binf
I am testing your code on binf I can’t possibly know what configuration
of what machine that your code runs under
The perl on binf must be the first line in your program #!/usr/bin/perl
BINF 634 Fall 2015 Lect 5 2
BINF 634 Fall 2015 Lect 5 3
Program 1 Solution#!/usr/bin/perluse strict;use warnings;
# File: cpg.pl# Author: Jeff Solka# Date: 01 Aug 2015## Purpose: Read sequences from a FASTA
format file# Programming Assignment #1
# the argument list should contain the file name
die "usage: fasta.pl filename\n" if scalar @ARGV < 1;
# get the filename from the argument listmy ($filename) = @ARGV;
# Open the file given as the first argument on the command line
open(INFILE, $filename) or die "Can't open $filename\n";
# variable declarations:my @header = (); # array of headersmy @sequence = (); # array of
sequencesmy $count = 0; # number of
sequences
# read FASTA filemy $n = -1; # index of current
sequencewhile (my $line = <INFILE>) { chomp $line; # remove training \
n from line if ($line =~ /^>/) { # line
starts with a ">"$n++; # this starts a new header$header[$n] = $line; # save header
line$sequence[$n] = ""; # start a new
(empty) sequence }
Program 1 Solution
Program 1 Solution (cont.) else {next if not @header; # ignore data before first
header$sequence[$n] .= $line # append to end of
current sequence }}$count = $n+1; # set count to the number of
sequencesclose INFILE;
# remove white space from all sequencesfor (my $i = 0; $i < $count; $i++) { $sequence[$i] =~ s/\s//g;}########## Sequence processing starts here:##### REST OF PROGRAM
my $maxlength = 0;my $minlength = 1E99;my $sumlength = 0;my $avlength = 0;
# process the sequencesfor (my $i = 0; $i < $count; $i++) { $sumlength += length($sequence[$i]); if(length($sequence[$i]) > $maxlength){ $maxlength = length($sequence[$i]); } if(length($sequence[$i]) < $minlength){ $minlength = length($sequence[$i]); }}
$avlength = $sumlength/$count;
# print out statisticsprint "Report for file $filename \n";print "There are $count sequences in the
file \n";print "Total sequence length = $sumlength \
n";print "Maximum sequence length = $maxlength
\n";print "Minimum sequence length = $minlength
\n";print "Ave sequence length = $avlength \n";
Program 1 Solution
4BINF 634 Fall 2015 Lect 5
BINF 634 Fall 2015 Lect 5 5
Program 1 Solution (cont.)# print out sequence informationfor (my $i = 0; $i < $count; $i++) { print "$header[$i]\n"; print
"Length:",length($sequence[$i]),"\n";
# Notice that we can use scalar
variables to hold numbers.my $a = 0; my $c = 0; my $g = 0; my $t = 0;my $cg = 0;
# Use a regular expression "trick", and five while loops,
# to find the counts of the four bases plus errors
while($sequence[$i] =~ /a/ig){$a++} while($sequence[$i] =~ /c/ig){$c++} while($sequence[$i] =~ /g/ig){$g++} while($sequence[$i] =~ /t/ig){$t++} while($sequence[$i] =~ /cg/ig){$cg++}
printf "A:%d %0.2f \n", $a, $a/length($sequence[$i]);
printf "C:%d %0.2f \n", $c, $c/length($sequence[$i]);
printf "G:%d %0.2f \n", $g, $g/length($sequence[$i]);
printf "T:%d %0.2f \n", $t, $t/length($sequence[$i]);
printf "CpG:%d %0.2f \n", $cg, $cg/length($sequence[$i]);
}
exit;
Program 1 Solution
BINF 634 Fall 2015 Lect 5 6
Quiz 2 Solution#!/usr/bin/perl -w
use strict;
use warnings;
#quiz2 Fall 2015
#Jeff Solka
my(@a)=(1..10);
print "array a prior to the function call \n";
print "@a \n";
myfun(\@a);
print "array a after the function call \n";
print "@a \n";
exit;
sub myfun{
my($i)=@_;
my $element;
foreach $element(@$i) {
$element = $element**2;
}
}
Program 1 Solution
BINF 634 Fall 2015 Lect 5 7
Quiz 2 Program in Action
Program in action.
[binf:binf634/quizzs/myquizzes] jsolka% ./quiz2.pl
array a prior to the function call
1 2 3 4 5 6 7 8 9 10
array a after the function call
1 4 9 16 25 36 49 64 81 100
Quiz 2 Solution
ANY QUESTIONS ON PROGRAM 2?
BINF 634 Fall 2015 Lect 5 8
Program 2 Discussions
Regular Expression Humor
A relevant cartoon
BINF 634 Fall 2015 Lect 5 9
Regular Expression (Humor)
BINF 634 Fall 2015 Lect 5 10
Regular Expressions Bioinformatics programs often have to look for patterns in strings:
Find a DNA sequences containing only C's and G's Look for a sequence that begins with ATG and ends with TAG
Regular expressions are a way of describing a PATTERN: "all the words that begin with the letter A" "every 10-digit phone number“
We create regular expression to match the different parts of the pattern we're looking for
Ordinary characters match themselves Meta-characters are special symbols that match a group of characters
for example \d matches any digit
Regular Expression (Why?)
Meta Characters(see Camel Book, Ch. 5)
. match any single character
[atcg] match any single a, t, c, or g
[A-Z] match any character in given range
[^atcg] match any character NOT in the set
\CHAR takes away meta meaning of character CHAR[\.\|\*] matches "." or "|" or "*"
^ or \A true at start of string
$ or \z true at end of string
\b\B
true at word boundarytrue when not at word boundary
\d\D
match any digitmatch any non-digit
\n\t
match newline charactermatch tab character
\s\S
match any white space charactermatch any non-whitespace character
\w\W
match any "word" character (alphanumeric plus "_")match any non-word character
Regular Expression (How?)
11BINF 634 Fall 2015 Lect 5
Ways to Control Patterns(see Camel Book, Ch. 5)
PATTERN1|PATTERN2 matches either PATTER1 or PATTERN2
PATTERN* matches zero or more instances of pattern. [A-Z]* = any number of capital letters (including 0)
PATTERN+ matches one or more instances of pattern. [A-Z]+ = one or more capital letters
PATTERN{N} matches exactly N instances of pattern[ATCG]{3} = one codon
PATTERN{MIN,MAX}
PATTERN{MIN,}
matches at least MIN but not more than MAX timesA[C]{2,4}G matches ACCG, ACCCG, or ACCCCGmatches at least MIN times
*?+?{MIN,MAX}?
matches 0 or more time, minimallymatches 1 or more time, minimallymatches MIN to MAX times, minimally
Regular Expression (How?)
12BINF 634 Fall 2015 Lect 5
Examples
# match if string $str contains 0 or more white space characters
$str =~ /^\s*$/;
# string $str contains all capital letters (at least one)
$str =~ /^[A-Z]+$/;
# string $str contains a capital letter followed by 0 or more digits
$str =~ /[A-Z]\d*/;
# number $n contains some digits before and after a decimal point
$n =~ /^\d+\.\d+$/;
# string contains A and B separated by any two characters
$s =~ /A..B/;
# string does NOT contains ATG
$s !~ /ATG/;
Regular Expression (Practice)
13BINF 634 Fall 2015 Lect 5
Examples
# match if string $str contains any sequence of three consecutive A's
$str =~ /AAA/;
$str =~ /A{3}/;
# match if string $str consist of exactly three A's
$str =~ /^AAA$/;
$str =~ /^A{3}$/;
# match if $str contains a codon for Alanine (GCA, GCT, GCC, GCG)
$str =~ /GC./;
# match if $str contains a STOP codon (TAA, TAG, TGA)
$str =~ /TA[AG]|TGA/;
$str =~ /T(AA|AG|GA)/;
$str =~ /T(A[AG]|GA)/;
Regular Expression (Practice)
14BINF 634 Fall 2015 Lect 5
Examples
# string contains any word containing all capital letters
$str =~ /\b[A-Z]+\b/;
# A followed by any number of C or G's followed by T or A
$str =~ /A[CG]*(T|A)/;
$str =~ /A[CG]{0,}[TA]/;
# TT followed by one or more CA's followed by anything except G
$str =~ /TT(CA)+[^G]/;
# string begins with B and has between 5 and 10 letters
$str =~ /^B.{4,9}$/;
# string consists of a 10 digit phone number: ddd-ddd-dddd$str =~ /^\d\d\d\-\d\d\d\-\d\d\d\d$/; $str =~ /^\d{3}\-\d{3}\-\d{4}$/;
Regular Expression (Practice)
15BINF 634 Fall 2015 Lect 5
BINF 634 Fall 2015 Lect 5 16
Capturing Matches When we match a string with a regular expression, we may want to find
out what matched Do this by surrounding the part of interest with ( ) Then access special variables $1, $2, etc to get matches:
$str = "Perl is a programming language used for bioinformatics.";
$str =~ /(.*) is.*(b.*)\./;
$first = $1;
$second = $2;
print "$first $second\n"; # prints "Perl bioinformatics"
# or, you can capture the results in a list assignment:
($first, $second) = $str =~ /(.*) is.*(b.*)\./;
print "$first $second\n"; # prints "Perl bioinformatics"
Regular Expression (What Did We Match?)
BINF 634 Fall 2015 Lect 5 17
Capturing Matches When we match a string with a regular expression, we may want to find out
what matched Do this by surrounding the part of interest with ( ) Then access special variables $1, $2, etc to get matches:
$str = "Perl is a programming language used for bioinformatics.";
$str =~ /(P.*l)/;
$word = $1;
print $word; # prints "Perl is a programming l"
$str =~ /(P.*?l)/;
$word = $1;
print $word; # prints "Perl"
$str =~ /\b(u.*?)\b/;
$word = $1;
print $word; # prints "used"
Regular Expression (What Did We Match?)
BINF 634 Fall 2015 Lect 5 18
Capturing Matches If no string is given to the match operators, $_ is assumed
@A = qw / ATGGCT CCCCGGTAT GCAGTGG /;
for (@A) {
($first, $second) = /(.+)GG(.+)/;
print "$first $second\n" if ($first and $second);
}
OUTPUT:
AT CT
CCCC TAT
Q. Why no output for third string?
Regular Expression (What Did We Match?)
#!/usr/bin/perluse strict;use warnings;
my $string = "Several rapidly developing RNA interference (RNAi)methodologies hold the promise to selectively inhibit gene expression inmammals. RNAi is an innate cellular process activated when adouble-stranded RNA (dsRNA) molecule of greater than 19 duplexnucleotides enters the cell, causing the degradation of not only theinvading dsRNA molecule, but also single-stranded (ssRNAs) RNAs ofidentical sequences, including endogenous mRNAs.";
# find all words containing "RNA"while ( $string =~ /(\w*RNA\w*)/g ) { print "$1\n";}exit;
Output:RNARNAiRNAiRNAdsRNAdsRNAssRNAsRNAsmRNAs
Regular Expression (What Did We Match?)
19BINF 634 Fall 2015 Lect 5
#!/usr/bin/perluse strict;use warnings;
my $string = "Several rapidly developing RNA interference (RNAi)methodologies hold the promise to selectively inhibit gene expression inmammals. RNAi is an innate cellular process activated when adouble-stranded RNA (dsRNA) molecule of greater than 19 duplexnucleotides enters the cell, causing the degradation of not only theinvading dsRNA molecule, but also single-stranded (ssRNAs) RNAs ofidentical sequences, including endogenous mRNAs.";
# find all words containing "RNA"while ( $string =~ /(\w+RNA\w+)/g ) { print "$1\n";}exit;
Output:ssRNAsmRNAs
Regular Expression (What Did We Match?)
20BINF 634 Fall 2015 Lect 5
#!/usr/bin/perluse strict;use warnings;
my $string = "Several rapidly developing RNA interference (RNAi)methodologies hold the promise to selectively inhibit gene expression inmammals. RNAi is an innate cellular process activated when adouble-stranded RNA (dsRNA) molecule of greater than 19 duplexnucleotides enters the cell, causing the degradation of not only theinvading dsRNA molecule, but also single-stranded (ssRNAs) RNAs ofidentical sequences, including endogenous mRNAs.";
# find all words containing "RNA"while ( $string =~ /(\S+RNA\S+)/g ) { print "$1\n";}exit;
Output:(RNAi)(dsRNA)(ssRNAs)mRNAs.
Regular Expression (What Did We Match?)
21BINF 634 Fall 2015 Lect 5
BINF 634 Fall 2015 Lect 5 22
Capturing MatchesWhen we match a string with a regular expression, several special variables
get set automatically:
$string =~ /REGEXP/;$` = part of string to the left of the match$& = part of string matched by the regular expression REGEXP$’ = part of string the the right the match
$string = "ATCGCAT";$string =~ /T.G/;print "left part: $` \n";print "match: $& \n";print "right part: $’ \n";
Output:left part: Amatch: TCGright part: CAT
Regular Expression (What Did We Match?)
BINF 634 Fall 2015 Lect 5 23
A Nice Application of Capturing Matches#!/usr/bin/perlprint ("\nEnter string or cntl-D to quit\n");print ("Square brackets indicate text that matched pattern\n\n");$prompt = "test> ";print $prompt;while(<STDIN>) {chomp;if(/REGEXP Goes Here/) {print("$`\[$&]$'\n");}else {print("no match\n");}print $prompt;}exit;
Regular Expression (A Regular Expression Tester)
An Even Nicer Implementation of This Idea - I
#!/usr/bin/perl
use strict;
use warnings;
# File: regex_tester.pl
# Author: Jim Logan
#
# Fully interactive version (i.e., no recompiles required) a regular expression
# tester based on a script by Fernando J. Pineda as presented to
# class of BINF623 by Jeff Solka on 10/1/12.
# Particularly useful in an Eclipse environment using its cut and paste facility.
# instructions for use
print "\nAccepts keyboard entry of a regular expression and then permits\n";
print "successive entry of strings to test that expression.\n";
print "Square brackets in output indicate the text that matched pattern\n\n";
print "Note: Depending upon the environment (e.g. Eclipse), you may be\n";
print "able to cut and paste into both the \"Next expression\" and the\n";
print "\"New test string\" fields and then edit as desired.\n";
BINF 634 Fall 2015 Lect 5 24
Regular Expression (A Nicer Regular Expression Tester)
An Even Nicer Implementation of This Idea - II
# initialization
my $regex = '/^.*$/'; #default regex to start and to demonstrate
my $string = 'This is a test string';
my $input = "";
my $stripped_regex = "";
while (1) { # outer loop to sequence regular expressions
print "\nCurrent regular expresssion: $regex\n";
print "Enter a new expression to change or ENTER to continue without change.\n";
print "(\"quit\" terminates the program)\n";
print "New expression: ";
$input = <STDIN>;
chomp $input;
if ($input =~ /^q.*$/i) {exit};
if ($input !~ /^$/) {
$regex = $input;
}
$stripped_regex = substr ($regex, 1, length ($regex) -2);
BINF 634 Fall 2015 Lect 5 25
Regular Expression (A Nicer Regular Expression Tester)
An Even Nicer Implementation of This Idea - III
# User includes the two slashes for a regular expresssion
# but they are stripped here so that variable is just the pattern
# that will be interpolated in /pattern/ context.
while (1) { # inner loop to sequence strings to test the expression
print "\nCurrent test string: $string\n";
print "Enter a new expression to change or ENTER to reset the regex.\n";
print "New test string: ";
$input = <STDIN>;
chomp $input;
if ($input =~ /^$/) { # for blank line, go back to set expresssion
last; }
else {
$string = $input; # else run regex over input
}
BINF 634 Fall 2015 Lect 5 26
Regular Expression (A Nicer Regular Expression Tester)
An Even Nicer Implementation of This Idea - IV
if( $string =~ /$stripped_regex/) {
print("$`\[$&]$'\n"); } # show match in context of input
else {
print("no match\n");
}
}
}
exit;
BINF 634 Fall 2015 Lect 5 27
Regular Expression (A Nicer Regular Expression Tester)
BINF 634 Fall 2015 Lect 5 28
Finding the position of matches
If we use the global modifier g, then pos($string) returns position after the match:
$string = "ATCGCATGGAA";
$string =~ /T.G/g;
print "$& ends at position ", pos($string)-1, "\n\";
$string =~ /T.G/g;
print "$& ends at position ", pos($string)-1, "\n";
Output:
TCG ends at position 3
TGG ends at position 8
Regular Expression (Where Did the Match Occur?)
#!/usr/bin/perluse strict;use warnings;
my $string = "Several rapidly developing RNA interference (RNAi)methodologies hold the promise to selectively inhibit gene expression inmammals. RNAi is an innate cellular process activated when adouble-stranded RNA (dsRNA) molecule of greater than 19 duplexnucleotides enters the cell, causing the degradation of not only theinvading dsRNA molecule, but also single-stranded (ssRNAs) RNAs ofidentical sequences, including endogenous mRNAs.";
# find all words containing "RNA"while ( $string =~ /(\S+RNA\S+)/g ) { print "$1 ends at position ", pos($string)-1, "\n";}exit;
Output:(RNAi) ends at position 49(dsRNA) ends at position 211(ssRNAs) ends at position 374mRNAs. ends at position 431
Regular Expression (Where Did the Match Occur?)
29BINF 634 Fall 2015 Lect 5
BINF 634 Fall 2015 Lect 5 30
Some Useful URLs http://docs.python.org/library/re.html http://www.regular-expressions.info/ http://www.regular-expressions.info/tutorial.html http://www.troubleshooters.com/codecorn/littperl/perlreg.htm
Additional Reading
BINF 634 Fall 2015 Lect 5 31
Homework Remember we meet Tuesday of week 10/13/15 at the usual
place and time due to the Columbus day Holiday. Program 2 due Tuesday 10/13/14at 7:00 pm. Quiz 3 will occur next week. Remember that on Tuesday October 19, 2015 we will have our
in class midterm exam. It will be open book and notes.
On the Horizon
BINF 634 Fall 2015 Lect 5 32
Regular Expression Lab Counts as a quiz grade
100 possible points
Our Regular Expression Lab
Top Related