Perl regular expressions This Powerpoint file can be found at: Kansas City Area SAS User Group...
-
date post
20-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of Perl regular expressions This Powerpoint file can be found at: Kansas City Area SAS User Group...
Perl regular expressions
This Powerpoint file can be found at:
http://www.ku.edu/pri/ksdata/sashttp/kcasug2004-10
Kansas City Area SAS User Group (KCASUG)
October 5, 2004
Larry Hoyle
Policy Research Institute, The University of Kansas
Regular expressions
• A regular expression is a pattern to be matched against some text (a string)
• originally from neurophysiology• Then in QED and grep
• see:• http://msdn.microsoft.com/library/default.asp?
url=/library/en-us/dnaspp/html/regexnet.asp
Perl regular expressions
• Practical Extraction and Report Language implements a version of regular expressions that is something of a standard
• see: http://www.perldoc.com/perl5.6.1/pod/perlre.html
SAS Documentation
Short syntax description
Some simple examples
/Baa/ matches the string "Baa"
/Baa\d/ matches "Baa" followed by
any numeric digit
Using Perl Regular Expressions in SAS 9.1 and above
data cc;
input c $; prxNum=prxParse('/Baa\d/'); start=prxMatch(prxNum,c); if start then put c= 'is a match'; else put c= 'does not match';datalines;BaaBaa2baa3aaaaBaa3;run;
proc sql; select * from cc where prxmatch('/Baa\d/',c);
Documentation for PRX Functions and Call Routines in SAS HELP
CALL PRXCHANGE
Performs a pattern-matching replacement
CALL PRXDEBUG
Enables Perl regular expressions in a DATA step to send debug output to the SAS log
CALL PRXFREE Frees unneeded memory that was allocated for a Perl regular expression
CALL PRXNEXT Returns the position and length of a substring that matches a pattern and iterates over multiple matches within one string
CALL PRXPOSN Returns the start position and length for a capture buffer
CALL PRXSUBSTR
Returns the position and length of a substring that matches a pattern
PRXCHANGE Function
Performs a pattern-matching replacement
PRXMATCH Function
Searches for a pattern match and returns the position at which the pattern is found
PRXPAREN Function
Returns the last bracket match for which there is a match in a pattern
PRXPARSE Function
Compiles a Perl regular expression (PRX) that can be used for pattern matching of a character value
PRXPOSN Function
Returns the value for a capture buffer
single character "wildcards"
. matches any character
\d matches a numeric character
\D matches a non-numeric
\w matches a "word character"
(letter, digit, or underscore)
\W matches a non-word character
\s matches white space (spaces or tabs)
\S matches non-white space
Try a different pattern for exprdata myturn;
retain expr '/Whatever/'; /* put your own expression here */
retain prxNum; length c $ 80; input c $80.; if _n_=1 then do; prxNum=prxParse(expr);
if prxNum=0 then put 'bad expression' expr= ; end; start=prxMatch(prxNum,c); put start= c= ;datalines;
Whatever floats your boatNow is the timefor all-goodmen 2come to the aid of their country.the quick brown fox jumped over the lazy dogThe quick red fox jumped over the 3 lazy dogsYou could replace this with whatever text you wanted.;run;
find all the numbersfind the first space on each linefind any non word characters
sample expressions
find all the numbers /\d/find the first space on each line /\s/find any non word characters /\W/
Anchors
^ beginning of the string
$ end of the string
Character Classes
[acB] matches "a", "c" or "B"
[D-G] matches "D", "E", "F", or "G"
[^aeiouyAEIOUY] matches any non vowel
Search for wordsdata mywords;
/* words starting with a-d */ retain expr '/^[a-dA-D]/'; retain prxNum; length word $ 50; input word $50.; if _n_=1 then do; prxNum=prxParse(expr);
if prxNum=0 then put 'bad expression' expr= ; end; start=prxMatch(prxNum,word); put start= c= ; if start>0;datalines;
aboocwmDublinoocytepneumonoultramicroscopicsilicovolcanoconiosisqatWashington;run;
find all the proper namesfind words with a "q" not followed by a "u"
How about?
find all the proper names find words with a "q" not followed by a "u"
How about?
find all the proper names /[A-Z]/
find words with a "q" not followed by a "u"
How about?
find all the proper names /[A-Z]/
find words with a "q" not followed by a "u" /q[^u]/
Multipliers
{n} previous expression n times e.g. {3} {n,} previous expression n or more times{n,m} previous expression from n to m times{0,m} previous expression m or fewer times
* previous expression 0 or more times {0,}
+ previous expression 1 or more times {1,}
? previous expression 0 or 1 times {0,1}
from the word list
find words without vowels
from the word list
find words without vowels /^[^aeiouyAEIOUY]+$/
"write only"? document your expressions
find words without vowels /^[^aeiouyAEIOUY]+$/
/*^ beginning of string[^aeiouyAEIOUY]+ one or more non-vowels$ end of string*/
Hangman Example
• Suppose we want to code the sequence of guesses in the game of hangman by the use of inferred strategies– e.g. did the person guess the most
frequently used letters first?– did the person guess vowels first?
Coding the strategiesdata HangmanGuesses;%let ns=4; drop i prxNum1-prxnum&ns; array expr{&ns} $ 80 ex1-ex&ns( '/^[aeiou]{3}/' '/^[etaoin]{6}/' '/^qwerty/' '/^[zqxjkv]{6}/' ); array used{&ns}used1-used&ns; label used1= '3 vowels first' used2= 'letter frequency' used3= 'qwerty' used4= 'unusuals' ; array prx{&ns}prxNum1-prxnum&ns; retain used1-used&ns; /* strategy
name */ retain ex1-ex&ns; /* strategy name */ retain prxNum1-prxnum&ns; /*prx
number */
length guess $ 13; input guess $13. success; guess=lowcase(guess);
if _n_=1 then do i=1 to &ns; prx{i}=prxParse(expr{i});
if prx{i}=0 then put "expression &ns is bad" expr{i}= ;
end; do i=1 to &ns; used{i}=prxMatch(prx{i},guess); end;datalines;eaotwhnrbg 1etaoinshrdlcu 0etaoinshrdluc 0qwertyuiopasd 0vkjxqznmasdfg 0asdfghjklzxcv 0argbe 1efghijklmnopq 0abcdefghijklm 0;
We get dummy variables
Looking at expression 2
Memory within match
(pattern) treat the pattern as a unit and remember the part of the string matched
\n inside the match recall substring n
example /(\d){3}X\1/ matches 123X123
not 123X456
Memory outside match
(pattern) treat the pattern as a unit and remember the part of the string matched
$n outside the match recall substring n
example s/(\w)+,(\w)+/ $2 $1/ substitutes Doe,John
with John Doe
Call log example
datalines;
I called Fred at 9:17 am at 785-555-1234
10:12 Called George - (913)-555-3213
816-555-9876 was Irving the time was 1:22 pm
751 555 1212 8384 3:33 Bob
;
Get the time
retain expTime '/\d{1,2}:\d{2}\s?(pm|am)?/';
/* \d{1,2}: one or two digits followed by a colon
\d{2}\s? two digits and optional space
(pm|am)? optional am or pm
*/
Get the phone numberdefine 3 capture buffers
retain expPhone '/\(?([2-9]\d\d)\)?[ -](\d\d\d)[ -](\d{4})/'; /* \(? optional left paren ([2-9]\d\d) 3 digit area code (buffer 1) \)? optional right paren [ -] space or hyphen (\d\d\d) 3 digit exchange (buffer 2) [ -] space or hyphen (\d{4}) 4 digit exchange (buffer 3) */
Use the expressions retain prxTime prxPhone;
if _n_=1 then do; prxTime=prxParse(expTime);
if prxTime=0 then put 'bad expression' expTime= ;
prxPhone=prxParse(expPhone);if prxPhone=0 then put 'bad expression'
expPhone= ; end;
sequence=_n_;
call prxsubstr(prxTime, note, position, length); time=substr(note,position,length);
call prxsubstr(prxPhone, note, position, length); phone=substr(note,position,length);
CALL PRXPOSN (prxPhone, 1, position, length); ac=substr(note,position,length);
CALL PRXPOSN (prxPhone, 2, position, length); exchange=substr(note, position,length); CALL PRXPOSN (prxPhone, 3, position, length); last4=substr(note, position,length);
local=exchange||'-'||last4;
Result
The time and phone number have been extracted.The phone number is standardized.
Substitution expressions
s/match expression/replacement/
s/cat/hat/ changes cat to hat
s/([a-zA-Z\-]+),([a-zA-Z\-]+)/$2 $1/
changes Doe-Roe,John to John Doe-Roe
Call PRXCHANGE(Data Step only)
CALL PRXCHANGE (regular-expression-id,
times,
old-string
<, new-string
<, result-length
<, truncation-value
<, number-of-changes>>>>);
PRXCHANGE(Data Step, SQL, where clauses)
PRXCHANGE(perl-regular-expression |
regular-expression-id,
times,
source)
data cc; length c $ 60 changedString $ 60; input c $60.; prxNum=prxParse('s/([a-zA-Z\-]+),[ ]*([a-zA-Z\-]+)/$2 $1/'); CALL prxChange (prxNum, 1, c, changedString, newLength, wasTruncated, numberChanges);
datalines;Doe-Roe,JohnBlackSheep, BaaBaaPrince;
PRXCHANGE example
s/([a-zA-Z\-]+) first word
, comma
[ ]* zero or more blanks
([a-zA-Z\-]+) second word
/$2 $1/ switch words
PRXCHANGE example results