An Introduction to Perl with Applications in Web Page Scraping

34
An Introduction to Perl with Applications in Web Page Scraping

description

An Introduction to Perl with Applications in Web Page Scraping. What is Perl?. Practical Extraction and Report Language High Level General purpose Interpreted, dynamic programming language Borrows from Unix shell scripting languages Ideal for “small” tasks which involve text processing. - PowerPoint PPT Presentation

Transcript of An Introduction to Perl with Applications in Web Page Scraping

Page 1: An Introduction to Perl with Applications in Web Page Scraping

An Introduction to Perl with Applications in Web Page Scraping

Page 2: An Introduction to Perl with Applications in Web Page Scraping

What is Perl? Practical Extraction and Report Language High Level General purpose Interpreted, dynamic programming

language Borrows from Unix shell scripting

languages Ideal for “small” tasks which involve text

processing

Page 3: An Introduction to Perl with Applications in Web Page Scraping

What is going to be taught during this workshop?

Most of this presentation takes from the www.perl.com introduction

Perl language constructs Variables

Flow control

String processing

File I/O

Subroutines

Object oriented Perl

Application: Web page scraping

Page 4: An Introduction to Perl with Applications in Web Page Scraping

Hello World> perl -e 'print "hello world\n"'

hello world

> perl -e 'print "hello ", "world\n"'

hello world

> perl -e "print 'hello ', 'world\n'"

hello world\n>

Page 5: An Introduction to Perl with Applications in Web Page Scraping

Scalars Single things

Number String

$fruitCount=5;

$fruitType='apples';

$countReport = "> There are $fruitCount $fruitType";

print $count_report;

> There are 5 apples

Page 6: An Introduction to Perl with Applications in Web Page Scraping

Scalars continued$a = "8";

$b = $a + "1";

print “> $b\n”;

> 9

$c = $a . "1";

print “> $c\n”

> 81

Page 7: An Introduction to Perl with Applications in Web Page Scraping

*Shameless taken from http://www.perl.com/pub/a/2000/10/begper

l1.html.

Even more scalar examples*$a = 5;

$a++; # $a is now 6; we added 1 to it.

$a += 10; # Now it's 16; we added 10.

$a /= 2; # And divided it by 2, so it's 8.

Page 8: An Introduction to Perl with Applications in Web Page Scraping

*Shameless taken from http://www.perl.com/pub/a/2000/10/begper

l1.html.

Arrays Lists of scalars

@months = ("July", "August", "September");

print $months[0]; #This prints "July".

$months[2] = "Smarch"; If an array doesn't exist you'll create it

when you try to assign a value to one of its elements.

$winterMonths[0] = "December"; #This implicitly #creates @winterMonths.

Page 9: An Introduction to Perl with Applications in Web Page Scraping

*Shameless taken from http://www.perl.com/pub/a/2000/10/begper

l1.html.

Arrays continued If you want to find the last index of an

array, use:

print “> $#months\n”;

> 2 If the array is empty or doesn't exist, -1 is

returned You can also resize a list

$#months=0 #Now months only contains “July”

Page 10: An Introduction to Perl with Applications in Web Page Scraping

*Shameless taken from http://www.perl.com/pub/a/2000/10/begper

l1.html.

Hashes Map a key to a value

%daysInMonth = ( "July" => 31, "August" => 31, "September" => 30 );

print “> $daysInMonth{'September'}\n”;

> 30 To add a new key and value,

$daysInMonth{"February"} = 28;

Page 11: An Introduction to Perl with Applications in Web Page Scraping

*Shameless taken from http://www.perl.com/pub/a/2000/10/begper

l1.html.

Hashed continued Getting the key values

print “>” . keys(%daysInMonth) . “\n”;

> 3

Page 12: An Introduction to Perl with Applications in Web Page Scraping

For loopsprint “> “;

for ($i=0; $i <= 5; $i++)

{

print “I can count to $i\n”;

}

print “\n”;

> 0 1 2 3 4 5

Page 13: An Introduction to Perl with Applications in Web Page Scraping

*Shameless taken from http://www.perl.com/pub/a/2000/10/begper

l1.html.

For loops Iterating over a list

print “> “;

for $i (5, 4, 3, 2, 1) {

print "$i ";

}

print “\n”;

> 5 4 3 2 1

Page 14: An Introduction to Perl with Applications in Web Page Scraping

*Shameless taken from http://www.perl.com/pub/a/2000/10/begper

l1.html.

For loops continued@one_to_ten = (1 .. 10);

$top_limit = 25;

for $i (@one_to_ten, 15, 20 .. $top_limit) {

print "$i\n";

}

Page 15: An Introduction to Perl with Applications in Web Page Scraping

*Shameless taken from http://www.perl.com/pub/a/2000/10/begper

l1.html.

One more for loopfor $marx ('Groucho', 'Harpo', 'Zeppo',

'Karl') {

print "> $marx is my favorite Marx brother.\n";

}

> Groucho is my favorite Marx brother.

> Harpo is my favorite Marx brother.

> Zeppo is my favorite Marx brother.

> Karl is my favorite Marx brother.

Page 16: An Introduction to Perl with Applications in Web Page Scraping

*Shameless taken from http://www.perl.com/pub/a/2000/10/begper

l1.html.

While loopmy $count = 0;

print “> “;

while ($count != 3) {

$count++;

print "$count ";

}

print “\n”;

> 1 2 3

Page 17: An Introduction to Perl with Applications in Web Page Scraping

*Shameless taken from http://www.perl.com/pub/a/2000/10/begper

l1.html.

Until loop$count=3;

print “> “;

until ($count == 0) {

$count--;

print "$count ";

}

print “\n”;

> 2 1 0

Page 18: An Introduction to Perl with Applications in Web Page Scraping

*Shameless taken from http://www.perl.com/pub/a/2000/10/begper

l1.html.

if/elsif/elseif ($a == 5) {

print "It's five!\n";

} elsif ($a == 6) {

print "It's six!\n";

} else {

print "It's something else.\n";

}

Page 19: An Introduction to Perl with Applications in Web Page Scraping

*Shameless taken from http://www.perl.com/pub/a/2000/10/begper

l1.html.

Unlessunless ($pie eq 'apple') {

print "Ew, I don't like $pie flavored pie.\n";

} else {

print "Apple! My favorite!\n";

}

Page 20: An Introduction to Perl with Applications in Web Page Scraping

Comparing unless and if

print "I'm burning the 7 pm oil\n" unless $day eq 'Friday';

print “I'm burning the 7pm oil\n” if not ($day eq 'Friday');

Page 21: An Introduction to Perl with Applications in Web Page Scraping

String operations$yes_no = 'no';

print “> affirmative\n” if $yes_no == 'yes';

> affirmative Strings are automatically converted to

numbers for operations like '==' Use eq instead of == for this to work

correctly

Page 22: An Introduction to Perl with Applications in Web Page Scraping

More string comparisonsmy $five = 5;

print "> Numeric equality!\n" if $five == " 5 ";

print "> String equality!\n" if $five eq "5";

> Numeric equality

> String equality

print "> No string equality!\n" if not($five eq " 5");

> No string equality

Page 23: An Introduction to Perl with Applications in Web Page Scraping

substr$greeting = "Welcome to Perl!\n";

print “> “.substr($greeting, 0, 7).”\n”;

> Welcome

print “> “, substr($greeting, 7) ”\n”;

> to Perl!

print “> “, substr($greeting, -6, 6), “>”;

> Perl!

>

Page 24: An Introduction to Perl with Applications in Web Page Scraping

substr continuedmy $greeting = "Welcome to Java!\n";

substr($greeting, 11, 4) = 'Perl';

# $greeting is now "Welcome to Perl!\n";

substr($greeting, 7, 3) = '';

# ... "Welcome Perl!\n";

substr($greeting, 0, 0) = 'Hello. ';

# ... "Hello. Welcome Perl!\n";

Page 25: An Introduction to Perl with Applications in Web Page Scraping

splitmy $greeting = "Hello. Welcome Perl!\n";

my @words = split(/ /, $greeting);

# Three items: "Hello.", "Welcome", "Perl!\n"

my $greeting = "Hello. Welcome Perl!\n";

my @words = split(/ /, $greeting, 2);

# Two items: "Hello.", "Welcome Perl!\n";

Page 26: An Introduction to Perl with Applications in Web Page Scraping

joinmy @words = ("Hello.", "Welcome", "Perl!\

n");

my $greeting = join(' ', @words);

# "Hello. Welcome Perl!\n";

my $andy_greeting = join(' and ', @words);

# "Hello. and Welcome and Perl!\n";

my $jam_greeting = join('', @words);

# "Hello.WelcomePerl!\n";

Page 27: An Introduction to Perl with Applications in Web Page Scraping

Reading from a fileThis

is

a

test

test.txt

Page 28: An Introduction to Perl with Applications in Web Page Scraping

Reading from a file continuedopen my $testfile, 'test.txt' or die "I

couldn't get at log.txt: $!";

while ($line=<$logfile>){

print “> “, $line;

}

> This

> is

> a

> test

Page 29: An Introduction to Perl with Applications in Web Page Scraping

chompopen my $testfile, 'test.txt' or die "I

couldn't get at log.txt: $!";

print “> “;

while (chomp($line=<$logfile>)){

print “$line “;

}

print “\n”;

> This is a test

Page 30: An Introduction to Perl with Applications in Web Page Scraping

Writing to a fileopen my $overwrite, '>', 'overwrite.txt' or

die "error trying to overwrite: $!";

# Wave goodbye to the original contents.

open my $append, '>>', 'append.txt' or die "error trying to append: $!";

# Original contents still there; add to the end of the file

Page 31: An Introduction to Perl with Applications in Web Page Scraping

Subroutinessub multiply{

my (@ops) = @_;

my $ret = 1;

for $val (@ops) {

$ret *= $val;

}

return $ret;

}

print "> ",multiply(2 .. 5), "\n";

> 120

Page 32: An Introduction to Perl with Applications in Web Page Scraping

Programming with objects

An objects is a programmer defined data structure which encapsulates

Data

Behavior (methods)

A web browser object may have Data

The current page A history of recently visited URL

Behavior

Can navigate to a page Can display a page

Page 33: An Introduction to Perl with Applications in Web Page Scraping

An Application: Scraping Web Pages

Page 34: An Introduction to Perl with Applications in Web Page Scraping

References Beginners introduction to Perl

http://www.perl.com/pub/a/2000/10/begperl1.html Perl Mechanize Library Documentation

http://search.cpan.org/dist/WWW-Mechanize/ Schwartz, R.L and Phoeniz, T., Lerning Perl, 3rd

Edition, November 1993.