An Introduction to Perl with Applications in Web Page Scraping
description
Transcript of An Introduction to Perl with Applications in Web Page Scraping
An Introduction to Perl with Applications in Web Page Scraping
What is Perl? Practical Extraction and Report Language High Level General purpose Interpreted, dynamic programming
language Borrows from Unix shell scripting
languages Ideal for “small” tasks which involve text
processing
What is going to be taught during this workshop?
Most of this presentation takes from the www.perl.com introduction
Perl language constructs Variables
Flow control
String processing
File I/O
Subroutines
Object oriented Perl
Application: Web page scraping
Hello World> perl -e 'print "hello world\n"'
hello world
> perl -e 'print "hello ", "world\n"'
hello world
> perl -e "print 'hello ', 'world\n'"
hello world\n>
Scalars Single things
Number String
$fruitCount=5;
$fruitType='apples';
$countReport = "> There are $fruitCount $fruitType";
print $count_report;
> There are 5 apples
Scalars continued$a = "8";
$b = $a + "1";
print “> $b\n”;
> 9
$c = $a . "1";
print “> $c\n”
> 81
*Shameless taken from http://www.perl.com/pub/a/2000/10/begper
l1.html.
Even more scalar examples*$a = 5;
$a++; # $a is now 6; we added 1 to it.
$a += 10; # Now it's 16; we added 10.
$a /= 2; # And divided it by 2, so it's 8.
*Shameless taken from http://www.perl.com/pub/a/2000/10/begper
l1.html.
Arrays Lists of scalars
@months = ("July", "August", "September");
print $months[0]; #This prints "July".
$months[2] = "Smarch"; If an array doesn't exist you'll create it
when you try to assign a value to one of its elements.
$winterMonths[0] = "December"; #This implicitly #creates @winterMonths.
*Shameless taken from http://www.perl.com/pub/a/2000/10/begper
l1.html.
Arrays continued If you want to find the last index of an
array, use:
print “> $#months\n”;
> 2 If the array is empty or doesn't exist, -1 is
returned You can also resize a list
$#months=0 #Now months only contains “July”
*Shameless taken from http://www.perl.com/pub/a/2000/10/begper
l1.html.
Hashes Map a key to a value
%daysInMonth = ( "July" => 31, "August" => 31, "September" => 30 );
print “> $daysInMonth{'September'}\n”;
> 30 To add a new key and value,
$daysInMonth{"February"} = 28;
*Shameless taken from http://www.perl.com/pub/a/2000/10/begper
l1.html.
Hashed continued Getting the key values
print “>” . keys(%daysInMonth) . “\n”;
> 3
For loopsprint “> “;
for ($i=0; $i <= 5; $i++)
{
print “I can count to $i\n”;
}
print “\n”;
> 0 1 2 3 4 5
*Shameless taken from http://www.perl.com/pub/a/2000/10/begper
l1.html.
For loops Iterating over a list
print “> “;
for $i (5, 4, 3, 2, 1) {
print "$i ";
}
print “\n”;
> 5 4 3 2 1
*Shameless taken from http://www.perl.com/pub/a/2000/10/begper
l1.html.
For loops continued@one_to_ten = (1 .. 10);
$top_limit = 25;
for $i (@one_to_ten, 15, 20 .. $top_limit) {
print "$i\n";
}
*Shameless taken from http://www.perl.com/pub/a/2000/10/begper
l1.html.
One more for loopfor $marx ('Groucho', 'Harpo', 'Zeppo',
'Karl') {
print "> $marx is my favorite Marx brother.\n";
}
> Groucho is my favorite Marx brother.
> Harpo is my favorite Marx brother.
> Zeppo is my favorite Marx brother.
> Karl is my favorite Marx brother.
*Shameless taken from http://www.perl.com/pub/a/2000/10/begper
l1.html.
While loopmy $count = 0;
print “> “;
while ($count != 3) {
$count++;
print "$count ";
}
print “\n”;
> 1 2 3
*Shameless taken from http://www.perl.com/pub/a/2000/10/begper
l1.html.
Until loop$count=3;
print “> “;
until ($count == 0) {
$count--;
print "$count ";
}
print “\n”;
> 2 1 0
*Shameless taken from http://www.perl.com/pub/a/2000/10/begper
l1.html.
if/elsif/elseif ($a == 5) {
print "It's five!\n";
} elsif ($a == 6) {
print "It's six!\n";
} else {
print "It's something else.\n";
}
*Shameless taken from http://www.perl.com/pub/a/2000/10/begper
l1.html.
Unlessunless ($pie eq 'apple') {
print "Ew, I don't like $pie flavored pie.\n";
} else {
print "Apple! My favorite!\n";
}
Comparing unless and if
print "I'm burning the 7 pm oil\n" unless $day eq 'Friday';
print “I'm burning the 7pm oil\n” if not ($day eq 'Friday');
String operations$yes_no = 'no';
print “> affirmative\n” if $yes_no == 'yes';
> affirmative Strings are automatically converted to
numbers for operations like '==' Use eq instead of == for this to work
correctly
More string comparisonsmy $five = 5;
print "> Numeric equality!\n" if $five == " 5 ";
print "> String equality!\n" if $five eq "5";
> Numeric equality
> String equality
print "> No string equality!\n" if not($five eq " 5");
> No string equality
substr$greeting = "Welcome to Perl!\n";
print “> “.substr($greeting, 0, 7).”\n”;
> Welcome
print “> “, substr($greeting, 7) ”\n”;
> to Perl!
print “> “, substr($greeting, -6, 6), “>”;
> Perl!
>
substr continuedmy $greeting = "Welcome to Java!\n";
substr($greeting, 11, 4) = 'Perl';
# $greeting is now "Welcome to Perl!\n";
substr($greeting, 7, 3) = '';
# ... "Welcome Perl!\n";
substr($greeting, 0, 0) = 'Hello. ';
# ... "Hello. Welcome Perl!\n";
splitmy $greeting = "Hello. Welcome Perl!\n";
my @words = split(/ /, $greeting);
# Three items: "Hello.", "Welcome", "Perl!\n"
my $greeting = "Hello. Welcome Perl!\n";
my @words = split(/ /, $greeting, 2);
# Two items: "Hello.", "Welcome Perl!\n";
joinmy @words = ("Hello.", "Welcome", "Perl!\
n");
my $greeting = join(' ', @words);
# "Hello. Welcome Perl!\n";
my $andy_greeting = join(' and ', @words);
# "Hello. and Welcome and Perl!\n";
my $jam_greeting = join('', @words);
# "Hello.WelcomePerl!\n";
Reading from a fileThis
is
a
test
test.txt
Reading from a file continuedopen my $testfile, 'test.txt' or die "I
couldn't get at log.txt: $!";
while ($line=<$logfile>){
print “> “, $line;
}
> This
> is
> a
> test
chompopen my $testfile, 'test.txt' or die "I
couldn't get at log.txt: $!";
print “> “;
while (chomp($line=<$logfile>)){
print “$line “;
}
print “\n”;
> This is a test
Writing to a fileopen my $overwrite, '>', 'overwrite.txt' or
die "error trying to overwrite: $!";
# Wave goodbye to the original contents.
open my $append, '>>', 'append.txt' or die "error trying to append: $!";
# Original contents still there; add to the end of the file
Subroutinessub multiply{
my (@ops) = @_;
my $ret = 1;
for $val (@ops) {
$ret *= $val;
}
return $ret;
}
print "> ",multiply(2 .. 5), "\n";
> 120
Programming with objects
An objects is a programmer defined data structure which encapsulates
Data
Behavior (methods)
A web browser object may have Data
The current page A history of recently visited URL
Behavior
Can navigate to a page Can display a page
An Application: Scraping Web Pages
References Beginners introduction to Perl
http://www.perl.com/pub/a/2000/10/begperl1.html Perl Mechanize Library Documentation
http://search.cpan.org/dist/WWW-Mechanize/ Schwartz, R.L and Phoeniz, T., Lerning Perl, 3rd
Edition, November 1993.