Bioinformatica 27-10-2011-p4-files
description
Transcript of Bioinformatica 27-10-2011-p4-files
![Page 1: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/1.jpg)
![Page 2: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/2.jpg)
FBW29-10-2008
Wim Van Criekinge
![Page 3: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/3.jpg)
Programming
• Variables
• Flow control (if, regex …)
• Loops
• input/output
• Subroutines/object
![Page 4: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/4.jpg)
Three Basic Data Types
• Scalars - $
• Arrays of scalars - @
• Associative arrays of scalers or Hashes - %
![Page 5: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/5.jpg)
• [m]/PATTERN/[g][i][o]
• s/PATTERN/PATTERN/[g][i][e][o]
• tr/PATTERNLIST/PATTERNLIST/[c][d][s]
![Page 6: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/6.jpg)
The ‘structure’ of a Hash
• An array looks something like this:
• A hash looks something like this:
@array =Index
Value
0 1 2
'val1' 'val2' 'val3'
Rob Matt Joe_A
353-7236 353-7122 555-1212
Key (name)
Value%phone =
![Page 7: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/7.jpg)
• First, create a list of keys. Fortunately, there is a function for that:– keys %hash (returns a list of keys)
• Next, visit each key and print its associated value:foreach (keys %hash){
print “The key $_ has the value $hash{$_}\n”;}
• One complication. Hashes do not maintain any sort of order. In other words, if you put key/value pairs into a hash in a particular order, you will not get them out in that order!!
Printing a hash (continued)
![Page 8: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/8.jpg)
my %AA1 = ( 'UUU','F','UUC','F','UUA','L','UUG','L','UCU','S','UCC','S','UCA','S','UCG','S','UAU','Y','UAC','Y','UAA','*','UAG','*','UGU','C','UGC','C','UGA','*','UGG','W',
'CUU','L','CUC','L','CUA','L','CUG','L','CCU','P','CCC','P','CCA','P','CCG','P','CAU','H','CAC','H','CAA','Q','CAG','Q','CGU','R','CGC','R','CGA','R','CGG','R',
'AUU','I','AUC','I','AUA','I',
'AUG','M','ACU','T','ACC','T','ACA','T','ACG','T','AAU','N','AAC','N','AAA','K','AAG','K','AGU','S','AGC','S','AGA','R','AGG','R',
'GUU','V','GUC','V','GUA','V','GUG','V','GCU','A','GCC','A','GCA','A','GCG','A','GAU','D','GAC','D','GAA','E','GAG','E','GGU','G','GGC','G','GGA','G',
'GGG','G' );
![Page 9: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/9.jpg)
• There is more than one right way to do it. Unfortunately, there are also many wrong ways. – 1. Always check and make sure the output is correct and logical
• Consider what errors might occur, and take steps to ensure that you are accounting for them.
– 2. Check to make sure you are using every variable you declare.• Use Strict !
– 3. Always go back to a script once it is working and see if you can eliminate unnecessary steps.
• Concise code is good code. • You will learn more if you optimize your code. • Concise does not mean comment free. Please use as many comments as
you think are necessary. • Sometimes you want to leave easy to understand code in, rather than short
but difficult to understand tricks. Use your judgment. • Remember that in the future, you may wish to use or alter the code you
wrote today. If you don’t understand it today, you won’t tomorrow.
Programming in general and Perl in particular
![Page 10: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/10.jpg)
Develop your program in stages. Once part of it works, save the working version to another file (or use a source code control system like RCS) before continuing to improve it.
When running interactively, show the user signs of activity. There is no need to dump everything to the screen (unless requested to), but a few words or a number change every few minutes will show that your program is doing something.
Comment your script. Any information on what it is doing or why might be useful to you a few months later.
Decide on a coding convention and stick to it. For example, – for variable names, begin globals with a capital letter and privates
(my) with a lower case letter – indent new control structures with (say) 2 spaces – line up closing braces, as in: if (....) { ... ... } – Add blank lines between sections to improve readibility
Programming in general and Perl in particular
![Page 11: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/11.jpg)
>ultimate-sequenceACTCGTTATGATATTTTTTTTGAACGTGAAAATACTTTTCGTGCT
ATGGAAGGACTCGTTATCGTGAAGTTGAACGTTCTGAATGTATGCCTCTTGAAATGGAAAATACTCATTGTTTATCTGAAATTTGAATGGGAATTTTATCTACAATGTTTTATTCTTACAGAACATTAAATTGTGTTATGTTTCATTTCACATTTTAGTAGTTTTTTCAGTGAAAGCTTGAAAACCACCAAGAAGAAAAGCTGGTATGCGTAGCTATGTATATATAAAATTAGATTTTCCACAAAAAATGATCTGATAAACCTTCTCTGTTGGCTCCAAGTATAAGTACGAAAAGAAATACGTTCCCAAGAATTAGCTTCATGAGTAAGAAGAAAAGCTGGTATGCGTAGCTATGTATATATAAAATTAGATTTTCCACAAAAAATGATCTGATAA
![Page 12: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/12.jpg)
File input / output
Opening a filehandle• In order to use a filehandle other than STDIN, STDOUT and
STDERR, the filehandle needs to be opened. The open function opens a file or device and associates it with a filehandle.
• It returns 1 upon success and undef otherwise. Examples• # open a filehandle for reading: open (SOURCE_FILE,
"filename"); • # or open (SOURCE_FILE, "<filename"); • # open a filehandle for writing: open (RESULT_FILE,
">filename"); • # open a filehandle for appending: open (LOGFILE,
">>filename";
![Page 13: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/13.jpg)
File input / output
Closing a filehandle• When you are finished with a filehandle, you
may close it with the close function. The close function closes the file or device associated with the filehandle.
Example:• close (MY_FILE_HANDLE); Filehandles are
automatically closed when the program exits, or when the filehandle is reopened.
![Page 14: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/14.jpg)
File input / output
The die function• Sometimes the open function fails. For example, opening a file for
input might fail because the file does not exist, and opening a file for output might fail because the file does not have a write permission. A perl program will nevertheless use the filehandle, and will not warn you that all input and output activities are actually meaningless.
• Therefore, it is recommended to explicitly check the result of the open command, and if it fails to print an error message and exit the program.
• This is easily done using the die function. Example:• my $k = open (FILEHANDLE, "filename"); unless ($k) { die ("cannot
open file filename: $!"); } # in case file "filename" cannot be opened, # the argument of die will be printed on # the screen and the program will exit. # $! is a special variable that contains the respective # error message sent by the operating system.. A short hand:
• open (FILEHANDLE, "filename") || die "cannot open file filename: $!";
![Page 15: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/15.jpg)
Using filehandles for writing
Example:#!/usr/local/bin/perl use strict; use warnings; open (OUTF, ">out_file") || die "cannot open out_file:
$!"; open (LOGF, ">>log_file") || die "cannot open log_file: $!";
print OUTF "Here is my program output\n"; print LOGF "First task of my program completed\n"; print "Nice, isn't it?\n"; # will be printed on the screen
close (OUTF); close (LOGF);
![Page 16: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/16.jpg)
When <FILEHANDLE> is assigned into an array variable, all lines up to the end of the file are read at once. Each line becomes a separate element of the array.
#!/usr/local/bin/perluse strict;use warnings;
my $infile = "CEACAM3.txt";open (FH, $infile) || die "cannot open \"$infile\": $!";my @lines = <FH>;chomp (@lines); # chomp each element of @linesclose (FH);
# to process the lines you might wish to iterate# over the @lines array with a foreach loop:my $line;foreach $line (@lines) { # process $line. here we just print it. print "$line\n";}
Using filehandles for reading (2/3)
![Page 17: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/17.jpg)
#!/usr/local/bin/perluse strict;use warnings;
my $infile = "CEACAM3.txt";my ($line1, $line2, $line3);
open (FH, $infile) || die "cannot open \"$infile\": $!";
$line1 = <FH>; # read first lineprint $line1; # proccess line (here we only print it)$line2 = <FH>; # read next lineprint $line2; # proccess line (here we only print it)$line3 = <FH>; # read next lineprint $line3; # proccess line (here we only print it)
close (FH);
Using filehandles for reading (1/3)
![Page 18: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/18.jpg)
Using a while loop, read one line at a time and assign it into a scalar variable, as long as the variable is not an empty string (which will happen at end-of-file).
Note that a blank line read from the file will not result in an empty string, since it still contains the terminating \n.
#!/usr/local/bin/perluse strict;use warnings;
my $infile = "CEACAM3.txt";open (FH, $infile) || die "cannot open \"$infile\": $!";
my $line; # or, in one line:while ($line = <FH>) { # while (my $line = <FH>) { chomp ($line); print "$line\n"; # process line. here we just print it. }
close (FH);
Using filehandles for reading (3/3)
![Page 19: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/19.jpg)
• Demo: Prosite Parser
![Page 20: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/20.jpg)
1. Swiss-Knife.pl
• Database – http://www.ebi.ac.uk/swissprot/FTP/ftp.html
– How many entries are there ?– Average Protein Length (in aa and MW)– Relative frequency of amino acids
• Compare to the ones used to construct the PAM scoring matrixes from 1978 – 1991
![Page 21: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/21.jpg)
Amino acid frequencies
1978 1991L 0.085 0.091A 0.087 0.077G 0.089 0.074S 0.070 0.069V 0.065 0.066E 0.050 0.062T 0.058 0.059K 0.081 0.059I 0.037 0.053D 0.047 0.052R 0.041 0.051P 0.051 0.051N 0.040 0.043Q 0.038 0.041F 0.040 0.040Y 0.030 0.032M 0.015 0.024H 0.034 0.023C 0.033 0.020W 0.010 0.014
Second step: Frequencies of Occurence
![Page 22: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/22.jpg)
Parser.pl• #! C:\Perl\bin\perl.exe -w• # (Vergeet niet het pad van perl.exe hierboven aan te passen aan de plaats op je eigen computer)
• # Voorbeeld van het gebruik van substrings en files• # in een parser van sequentie-informatie-records
• use strict;• use warnings;
• my ($sp_file,$line,$id,$ac,$de);
• $sp_file= "sp.txt";• open (SP,$sp_file) || die "cannot open \"$sp_file\":$!";
• while ($line=<SP>){• chomp($line);
• my $field = substr ($line,0,2);• my $value = substr ($line,5);
• if ($field eq "ID"){e• $id = $value• }• if ($field eq "AC"){• $ac = $value• }• if ($field eq "DE"){• $de = $value• }• }
• print "Identification: $id\n";• print "Accession No.: $ac\n";• print "Description: $de\n";
![Page 23: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/23.jpg)
2. PAM-simulator.pl
– Check transition matrix with and without randomizing the rows of evolutions
– Adapt the program to simulate evolving DNA
– Adapt the program so it generates random proteins taking into account the relative frequences found in step 1
– Write the output to a multi-fasta file>PAM1AHFALKJHFDLKFJHALSKJFH>PAM2AHGALKJHFDLKFJHALSKJFH>PAM3AHGALKJHFDLKFJHALSKJFH…..
![Page 24: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/24.jpg)
• Initialize: – Generate Random protein (1000 aa)
• Simulate evolution (eg 250 for PAM250)– Apply PAM1 Transition matrix to each amino
acid– Use Weighted Random Selection
• Iterate – Measure difference to orginal protein
Experiment: pam-simulator.pl
![Page 25: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/25.jpg)
Dayhoff’s PAM1 mutation probability matrix (Transition Matrix)
AAla
RArg
NAsn
DAsp
CCys
QGln
EGlu
GGly
HHis
IIle
A 9867 2 9 10 3 8 17 21 2 6
R 1 9913 1 0 1 10 0 0 10 3
N 4 1 9822 36 0 4 6 6 21 3
D 6 0 42 9859 0 6 53 6 4 1
C 1 1 0 0 9973 0 0 0 1 1
Q 3 9 4 5 0 9876 27 1 23 1
E 10 0 7 56 0 35 9865 4 2 3
G 21 1 12 11 1 3 7 9935 1 0
H 1 8 18 3 1 20 1 0 9912 0
I 2 2 3 1 2 1 2 0 0 9872
![Page 26: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/26.jpg)
Weighted Random Selection
• Ala => Xxx (%)A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
![Page 27: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/27.jpg)
PAM-Simulator
PAM-simulator
0
20
40
60
80
100
120
0 50 100 150 200 250 300
PAM
%id
enti
ty
![Page 28: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/28.jpg)
3. Palindromes
What is the longest palindroom in palin.fasta ?
Why are restriction sites palindromic ? How long is the longest palindroom in the genome ?
Hints: http://www.man.poznan.pl/cmst/papers/5/art_2/vol5art2.htmlPalingram.pl
![Page 29: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/29.jpg)
Palin.fasta
• >palin.fasta• ATGGCTTATTTATTTGCCCACAAGAACTTAGGTGCATTGAAATCTAAAG
CTAATTGCTTATTTAGCTTTGCTTGGCCTTTTCACTTAAATAAAACATAGCATCAACTTCAGCAGGAATGGGTGCACATGCTGATCGAGGTGGAAGAAGGGCACATATGGCATCGGCATCCTTATGGCTAATTTTAAATGGAGAACTTTCTAAAGTCACGTTTTCACATGCAATATTCTTAACATTTTCAATTTTTTTTGTAACTAATTCTTCCCATCTACTATGTGTTTGCAAGACAATCTCAGTAGCAAACTCCTTATGCTTAGCCTCACCGTTAAAAGCAAACTTATTTGGGGGATCTCCACCAGGCATTTTATATATTTTGAACCACTCTACTGACGCGTTAGCTTCAAGTAAACCAGGCATCACTTCTTTTACGTCATCAATATCATTAAGCTTTGAAGCTAGAGGATCATTTACATCAATTGCTATTACTTAGCTTAGCCCTTCAAGTACTTGAAGGGCTAAGCTTCCAATCTGTTTCACCATTGTCAATCATAGCTAAGACACCCAGCAACTTAACTTGCAAAACAGATCCTCTTTCTGCAACTTTGTAACCTATCTCTATTACATCAACAGGATCACCATCACCAAATGCATTAGTGTGCTCATCAATAAGATTTGGATCCTCCCAAGTCTGTGGCAAAGCTCCATAATTCCAAGGATAACC
![Page 30: Bioinformatica 27-10-2011-p4-files](https://reader036.fdocuments.us/reader036/viewer/2022062704/5562feedd8b42a6f598b4e16/html5/thumbnails/30.jpg)
Palingram.pl#!E:\perl\bin\perl -w$line_input = "edellede parterretrap trap op sirenes en er is popart test";$line_input =~ s/\s//g;$l = length($line_input);for ($m = 0;$m<=$l-1;$m++){$line = substr($line_input,$m);print "length=$m:$l\t".$line."\n";for $n (8..25) { $re = qr /[a-z]{$n}/; print "pattern ($n) = $re\n"; $regexes[$n-8] = $re; }foreach (@regexes) { while ($line =~ m/$_/g) { $endline = $'; $match = $&; $all = $match.$endline; $revmatch = reverse($match); if ($all =~ /^($revmatch)/)
{ $palindrome = $revmatch . "*" . $1 ; $palhash{$palindrome}++; }
} }}
print "Set van palingram\n";while(($key, $value) = each (%palhash)) { print "$key => $value\n"; }