Perl regular expressions This Powerpoint file can be found at: Kansas City Area SAS User Group...

Post on 20-Dec-2015

216 views 0 download

Transcript of Perl regular expressions This Powerpoint file can be found at: Kansas City Area SAS User Group...

Perl regular expressions

This Powerpoint file can be found at:

http://www.ku.edu/pri/ksdata/sashttp/kcasug2004-10

Kansas City Area SAS User Group (KCASUG)

October 5, 2004

Larry Hoyle

Policy Research Institute, The University of Kansas

Regular expressions

• A regular expression is a pattern to be matched against some text (a string)

• originally from neurophysiology• Then in QED and grep

• see:• http://msdn.microsoft.com/library/default.asp?

url=/library/en-us/dnaspp/html/regexnet.asp

Perl regular expressions

• Practical Extraction and Report Language implements a version of regular expressions that is something of a standard

• see: http://www.perldoc.com/perl5.6.1/pod/perlre.html

SAS Documentation

Short syntax description

Some simple examples

/Baa/ matches the string "Baa"

/Baa\d/ matches "Baa" followed by

any numeric digit

Using Perl Regular Expressions in SAS 9.1 and above

data cc;

input c $; prxNum=prxParse('/Baa\d/'); start=prxMatch(prxNum,c); if start then put c= 'is a match'; else put c= 'does not match';datalines;BaaBaa2baa3aaaaBaa3;run;

proc sql; select * from cc where prxmatch('/Baa\d/',c);

Documentation for PRX Functions and Call Routines in SAS HELP

CALL PRXCHANGE

Performs a pattern-matching replacement

CALL PRXDEBUG

Enables Perl regular expressions in a DATA step to send debug output to the SAS log

CALL PRXFREE Frees unneeded memory that was allocated for a Perl regular expression

CALL PRXNEXT Returns the position and length of a substring that matches a pattern and iterates over multiple matches within one string

CALL PRXPOSN Returns the start position and length for a capture buffer

CALL PRXSUBSTR

Returns the position and length of a substring that matches a pattern

PRXCHANGE Function

Performs a pattern-matching replacement

PRXMATCH Function

Searches for a pattern match and returns the position at which the pattern is found

PRXPAREN Function

Returns the last bracket match for which there is a match in a pattern

PRXPARSE Function

Compiles a Perl regular expression (PRX) that can be used for pattern matching of a character value

PRXPOSN Function

Returns the value for a capture buffer

single character "wildcards"

. matches any character

\d matches a numeric character

\D matches a non-numeric

\w matches a "word character"

(letter, digit, or underscore)

\W matches a non-word character

\s matches white space (spaces or tabs)

\S matches non-white space

Try a different pattern for exprdata myturn;

retain expr '/Whatever/'; /* put your own expression here */

retain prxNum; length c $ 80; input c $80.; if _n_=1 then do; prxNum=prxParse(expr);

if prxNum=0 then put 'bad expression' expr= ; end; start=prxMatch(prxNum,c); put start= c= ;datalines;

Whatever floats your boatNow is the timefor all-goodmen 2come to the aid of their country.the quick brown fox jumped over the lazy dogThe quick red fox jumped over the 3 lazy dogsYou could replace this with whatever text you wanted.;run;

find all the numbersfind the first space on each linefind any non word characters

sample expressions

find all the numbers /\d/find the first space on each line /\s/find any non word characters /\W/

Anchors

^ beginning of the string

$ end of the string

Character Classes

[acB] matches "a", "c" or "B"

[D-G] matches "D", "E", "F", or "G"

[^aeiouyAEIOUY] matches any non vowel

Search for wordsdata mywords;

/* words starting with a-d */ retain expr '/^[a-dA-D]/'; retain prxNum; length word $ 50; input word $50.; if _n_=1 then do; prxNum=prxParse(expr);

if prxNum=0 then put 'bad expression' expr= ; end; start=prxMatch(prxNum,word); put start= c= ; if start>0;datalines;

aboocwmDublinoocytepneumonoultramicroscopicsilicovolcanoconiosisqatWashington;run;

find all the proper namesfind words with a "q" not followed by a "u"

How about?

find all the proper names find words with a "q" not followed by a "u"

How about?

find all the proper names /[A-Z]/

find words with a "q" not followed by a "u"

How about?

find all the proper names /[A-Z]/

find words with a "q" not followed by a "u" /q[^u]/

Multipliers

{n} previous expression n times e.g. {3} {n,} previous expression n or more times{n,m} previous expression from n to m times{0,m} previous expression m or fewer times

* previous expression 0 or more times {0,}

+ previous expression 1 or more times {1,}

? previous expression 0 or 1 times {0,1}

from the word list

find words without vowels

from the word list

find words without vowels /^[^aeiouyAEIOUY]+$/

"write only"? document your expressions

find words without vowels /^[^aeiouyAEIOUY]+$/

/*^ beginning of string[^aeiouyAEIOUY]+ one or more non-vowels$ end of string*/

Hangman Example

• Suppose we want to code the sequence of guesses in the game of hangman by the use of inferred strategies– e.g. did the person guess the most

frequently used letters first?– did the person guess vowels first?

Coding the strategiesdata HangmanGuesses;%let ns=4; drop i prxNum1-prxnum&ns; array expr{&ns} $ 80 ex1-ex&ns( '/^[aeiou]{3}/' '/^[etaoin]{6}/' '/^qwerty/' '/^[zqxjkv]{6}/' ); array used{&ns}used1-used&ns; label used1= '3 vowels first' used2= 'letter frequency' used3= 'qwerty' used4= 'unusuals' ; array prx{&ns}prxNum1-prxnum&ns; retain used1-used&ns; /* strategy

name */ retain ex1-ex&ns; /* strategy name */ retain prxNum1-prxnum&ns; /*prx

number */

length guess $ 13; input guess $13. success; guess=lowcase(guess);

if _n_=1 then do i=1 to &ns; prx{i}=prxParse(expr{i});

if prx{i}=0 then put "expression &ns is bad" expr{i}= ;

end; do i=1 to &ns; used{i}=prxMatch(prx{i},guess); end;datalines;eaotwhnrbg 1etaoinshrdlcu 0etaoinshrdluc 0qwertyuiopasd 0vkjxqznmasdfg 0asdfghjklzxcv 0argbe 1efghijklmnopq 0abcdefghijklm 0;

We get dummy variables

Looking at expression 2

Memory within match

(pattern) treat the pattern as a unit and remember the part of the string matched

\n inside the match recall substring n

example /(\d){3}X\1/ matches 123X123

not 123X456

Memory outside match

(pattern) treat the pattern as a unit and remember the part of the string matched

$n outside the match recall substring n

example s/(\w)+,(\w)+/ $2 $1/ substitutes Doe,John

with John Doe

Call log example

datalines;

I called Fred at 9:17 am at 785-555-1234

10:12 Called George - (913)-555-3213

816-555-9876 was Irving the time was 1:22 pm

751 555 1212 8384 3:33 Bob

;

Get the time

retain expTime '/\d{1,2}:\d{2}\s?(pm|am)?/';

/* \d{1,2}: one or two digits followed by a colon

\d{2}\s? two digits and optional space

(pm|am)? optional am or pm

*/

Get the phone numberdefine 3 capture buffers

retain expPhone '/\(?([2-9]\d\d)\)?[ -](\d\d\d)[ -](\d{4})/'; /* \(? optional left paren ([2-9]\d\d) 3 digit area code (buffer 1) \)? optional right paren [ -] space or hyphen (\d\d\d) 3 digit exchange (buffer 2) [ -] space or hyphen (\d{4}) 4 digit exchange (buffer 3) */

Use the expressions retain prxTime prxPhone;

if _n_=1 then do; prxTime=prxParse(expTime);

if prxTime=0 then put 'bad expression' expTime= ;

prxPhone=prxParse(expPhone);if prxPhone=0 then put 'bad expression'

expPhone= ; end;

sequence=_n_;

call prxsubstr(prxTime, note, position, length); time=substr(note,position,length);

call prxsubstr(prxPhone, note, position, length); phone=substr(note,position,length);

CALL PRXPOSN (prxPhone, 1, position, length); ac=substr(note,position,length);

CALL PRXPOSN (prxPhone, 2, position, length); exchange=substr(note, position,length); CALL PRXPOSN (prxPhone, 3, position, length); last4=substr(note, position,length);

local=exchange||'-'||last4;

Result

The time and phone number have been extracted.The phone number is standardized.

Substitution expressions

s/match expression/replacement/

s/cat/hat/ changes cat to hat

s/([a-zA-Z\-]+),([a-zA-Z\-]+)/$2 $1/

changes Doe-Roe,John to John Doe-Roe

Call PRXCHANGE(Data Step only)

CALL PRXCHANGE (regular-expression-id,

times,

old-string

<, new-string

<, result-length

<, truncation-value

<, number-of-changes>>>>);

PRXCHANGE(Data Step, SQL, where clauses)

PRXCHANGE(perl-regular-expression |

regular-expression-id,

times,

source)

data cc; length c $ 60 changedString $ 60; input c $60.; prxNum=prxParse('s/([a-zA-Z\-]+),[ ]*([a-zA-Z\-]+)/$2 $1/'); CALL prxChange (prxNum, 1, c, changedString, newLength, wasTruncated, numberChanges);

datalines;Doe-Roe,JohnBlackSheep, BaaBaaPrince;

PRXCHANGE example

s/([a-zA-Z\-]+) first word

, comma

[ ]* zero or more blanks

([a-zA-Z\-]+) second word

/$2 $1/ switch words

PRXCHANGE example results