Perl regular expressions This Powerpoint file can be found at: Kansas City Area SAS User Group...

37
Perl regular expressions This Powerpoint file can be found at: http://www.ku.edu/pri/ksdata/sashttp/ kcasug2004-10 Kansas City Area SAS User Group (KCASUG) October 5, 2004 Larry Hoyle Policy Research Institute, The University of Kansas
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of Perl regular expressions This Powerpoint file can be found at: Kansas City Area SAS User Group...

Page 1: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Perl regular expressions

This Powerpoint file can be found at:

http://www.ku.edu/pri/ksdata/sashttp/kcasug2004-10

Kansas City Area SAS User Group (KCASUG)

October 5, 2004

Larry Hoyle

Policy Research Institute, The University of Kansas

Page 2: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Regular expressions

• A regular expression is a pattern to be matched against some text (a string)

• originally from neurophysiology• Then in QED and grep

• see:• http://msdn.microsoft.com/library/default.asp?

url=/library/en-us/dnaspp/html/regexnet.asp

Page 3: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Perl regular expressions

• Practical Extraction and Report Language implements a version of regular expressions that is something of a standard

• see: http://www.perldoc.com/perl5.6.1/pod/perlre.html

Page 4: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

SAS Documentation

Page 5: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Short syntax description

Page 6: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Some simple examples

/Baa/ matches the string "Baa"

/Baa\d/ matches "Baa" followed by

any numeric digit

Page 7: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Using Perl Regular Expressions in SAS 9.1 and above

data cc;

input c $; prxNum=prxParse('/Baa\d/'); start=prxMatch(prxNum,c); if start then put c= 'is a match'; else put c= 'does not match';datalines;BaaBaa2baa3aaaaBaa3;run;

proc sql; select * from cc where prxmatch('/Baa\d/',c);

Page 8: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Documentation for PRX Functions and Call Routines in SAS HELP

CALL PRXCHANGE

Performs a pattern-matching replacement

CALL PRXDEBUG

Enables Perl regular expressions in a DATA step to send debug output to the SAS log

CALL PRXFREE Frees unneeded memory that was allocated for a Perl regular expression

CALL PRXNEXT Returns the position and length of a substring that matches a pattern and iterates over multiple matches within one string

CALL PRXPOSN Returns the start position and length for a capture buffer

CALL PRXSUBSTR

Returns the position and length of a substring that matches a pattern

PRXCHANGE Function

Performs a pattern-matching replacement

PRXMATCH Function

Searches for a pattern match and returns the position at which the pattern is found

PRXPAREN Function

Returns the last bracket match for which there is a match in a pattern

PRXPARSE Function

Compiles a Perl regular expression (PRX) that can be used for pattern matching of a character value

PRXPOSN Function

Returns the value for a capture buffer

Page 9: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

single character "wildcards"

. matches any character

\d matches a numeric character

\D matches a non-numeric

\w matches a "word character"

(letter, digit, or underscore)

\W matches a non-word character

\s matches white space (spaces or tabs)

\S matches non-white space

Page 10: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Try a different pattern for exprdata myturn;

retain expr '/Whatever/'; /* put your own expression here */

retain prxNum; length c $ 80; input c $80.; if _n_=1 then do; prxNum=prxParse(expr);

if prxNum=0 then put 'bad expression' expr= ; end; start=prxMatch(prxNum,c); put start= c= ;datalines;

Whatever floats your boatNow is the timefor all-goodmen 2come to the aid of their country.the quick brown fox jumped over the lazy dogThe quick red fox jumped over the 3 lazy dogsYou could replace this with whatever text you wanted.;run;

find all the numbersfind the first space on each linefind any non word characters

Page 11: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

sample expressions

find all the numbers /\d/find the first space on each line /\s/find any non word characters /\W/

Page 12: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Anchors

^ beginning of the string

$ end of the string

Page 13: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Character Classes

[acB] matches "a", "c" or "B"

[D-G] matches "D", "E", "F", or "G"

[^aeiouyAEIOUY] matches any non vowel

Page 14: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Search for wordsdata mywords;

/* words starting with a-d */ retain expr '/^[a-dA-D]/'; retain prxNum; length word $ 50; input word $50.; if _n_=1 then do; prxNum=prxParse(expr);

if prxNum=0 then put 'bad expression' expr= ; end; start=prxMatch(prxNum,word); put start= c= ; if start>0;datalines;

aboocwmDublinoocytepneumonoultramicroscopicsilicovolcanoconiosisqatWashington;run;

find all the proper namesfind words with a "q" not followed by a "u"

Page 15: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

How about?

find all the proper names find words with a "q" not followed by a "u"

Page 16: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

How about?

find all the proper names /[A-Z]/

find words with a "q" not followed by a "u"

Page 17: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

How about?

find all the proper names /[A-Z]/

find words with a "q" not followed by a "u" /q[^u]/

Page 18: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Multipliers

{n} previous expression n times e.g. {3} {n,} previous expression n or more times{n,m} previous expression from n to m times{0,m} previous expression m or fewer times

* previous expression 0 or more times {0,}

+ previous expression 1 or more times {1,}

? previous expression 0 or 1 times {0,1}

Page 19: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

from the word list

find words without vowels

Page 20: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

from the word list

find words without vowels /^[^aeiouyAEIOUY]+$/

Page 21: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

"write only"? document your expressions

find words without vowels /^[^aeiouyAEIOUY]+$/

/*^ beginning of string[^aeiouyAEIOUY]+ one or more non-vowels$ end of string*/

Page 22: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Hangman Example

• Suppose we want to code the sequence of guesses in the game of hangman by the use of inferred strategies– e.g. did the person guess the most

frequently used letters first?– did the person guess vowels first?

Page 23: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Coding the strategiesdata HangmanGuesses;%let ns=4; drop i prxNum1-prxnum&ns; array expr{&ns} $ 80 ex1-ex&ns( '/^[aeiou]{3}/' '/^[etaoin]{6}/' '/^qwerty/' '/^[zqxjkv]{6}/' ); array used{&ns}used1-used&ns; label used1= '3 vowels first' used2= 'letter frequency' used3= 'qwerty' used4= 'unusuals' ; array prx{&ns}prxNum1-prxnum&ns; retain used1-used&ns; /* strategy

name */ retain ex1-ex&ns; /* strategy name */ retain prxNum1-prxnum&ns; /*prx

number */

length guess $ 13; input guess $13. success; guess=lowcase(guess);

if _n_=1 then do i=1 to &ns; prx{i}=prxParse(expr{i});

if prx{i}=0 then put "expression &ns is bad" expr{i}= ;

end; do i=1 to &ns; used{i}=prxMatch(prx{i},guess); end;datalines;eaotwhnrbg 1etaoinshrdlcu 0etaoinshrdluc 0qwertyuiopasd 0vkjxqznmasdfg 0asdfghjklzxcv 0argbe 1efghijklmnopq 0abcdefghijklm 0;

Page 24: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

We get dummy variables

Page 25: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Looking at expression 2

Page 26: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Memory within match

(pattern) treat the pattern as a unit and remember the part of the string matched

\n inside the match recall substring n

example /(\d){3}X\1/ matches 123X123

not 123X456

Page 27: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Memory outside match

(pattern) treat the pattern as a unit and remember the part of the string matched

$n outside the match recall substring n

example s/(\w)+,(\w)+/ $2 $1/ substitutes Doe,John

with John Doe

Page 28: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Call log example

datalines;

I called Fred at 9:17 am at 785-555-1234

10:12 Called George - (913)-555-3213

816-555-9876 was Irving the time was 1:22 pm

751 555 1212 8384 3:33 Bob

;

Page 29: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Get the time

retain expTime '/\d{1,2}:\d{2}\s?(pm|am)?/';

/* \d{1,2}: one or two digits followed by a colon

\d{2}\s? two digits and optional space

(pm|am)? optional am or pm

*/

Page 30: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Get the phone numberdefine 3 capture buffers

retain expPhone '/\(?([2-9]\d\d)\)?[ -](\d\d\d)[ -](\d{4})/'; /* \(? optional left paren ([2-9]\d\d) 3 digit area code (buffer 1) \)? optional right paren [ -] space or hyphen (\d\d\d) 3 digit exchange (buffer 2) [ -] space or hyphen (\d{4}) 4 digit exchange (buffer 3) */

Page 31: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Use the expressions retain prxTime prxPhone;

if _n_=1 then do; prxTime=prxParse(expTime);

if prxTime=0 then put 'bad expression' expTime= ;

prxPhone=prxParse(expPhone);if prxPhone=0 then put 'bad expression'

expPhone= ; end;

sequence=_n_;

call prxsubstr(prxTime, note, position, length); time=substr(note,position,length);

call prxsubstr(prxPhone, note, position, length); phone=substr(note,position,length);

CALL PRXPOSN (prxPhone, 1, position, length); ac=substr(note,position,length);

CALL PRXPOSN (prxPhone, 2, position, length); exchange=substr(note, position,length); CALL PRXPOSN (prxPhone, 3, position, length); last4=substr(note, position,length);

local=exchange||'-'||last4;

Page 32: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Result

The time and phone number have been extracted.The phone number is standardized.

Page 33: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Substitution expressions

s/match expression/replacement/

s/cat/hat/ changes cat to hat

s/([a-zA-Z\-]+),([a-zA-Z\-]+)/$2 $1/

changes Doe-Roe,John to John Doe-Roe

Page 34: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

Call PRXCHANGE(Data Step only)

CALL PRXCHANGE (regular-expression-id,

times,

old-string

<, new-string

<, result-length

<, truncation-value

<, number-of-changes>>>>);

Page 35: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

PRXCHANGE(Data Step, SQL, where clauses)

PRXCHANGE(perl-regular-expression |

regular-expression-id,

times,

source)

Page 36: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

data cc; length c $ 60 changedString $ 60; input c $60.; prxNum=prxParse('s/([a-zA-Z\-]+),[ ]*([a-zA-Z\-]+)/$2 $1/'); CALL prxChange (prxNum, 1, c, changedString, newLength, wasTruncated, numberChanges);

datalines;Doe-Roe,JohnBlackSheep, BaaBaaPrince;

PRXCHANGE example

s/([a-zA-Z\-]+) first word

, comma

[ ]* zero or more blanks

([a-zA-Z\-]+) second word

/$2 $1/ switch words

Page 37: Perl regular expressions This Powerpoint file can be found at:  Kansas City Area SAS User Group (KCASUG)

PRXCHANGE example results