Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.
-
Upload
vincent-bates -
Category
Documents
-
view
220 -
download
1
Transcript of Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.
![Page 1: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/1.jpg)
Data Cleaning 101
Ron Cody, Ed.DRobert Wood Johnson Medical School
Piscataway, NJ
![Page 2: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/2.jpg)
Sample Data SetVariableName Description Type Valid ValuesPATNO Patient Number Character NumeralsGENDER Gender Character ‘M' or 'F' VISIT Visit Date MMDDYY10. Any valid dateHR Heart Rate Numeric 40 to 100SBP Systolic Blood Pres. Numeric 80 to 200DBP Diastolic Blood Pres. Numeric 60 to 120DX Diagnosis Code Character 1 to 3 digitsAE Adverse Event Character '0' or '1'
![Page 3: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/3.jpg)
Using PROC FREQ to Look for Invalid Character Values
PROC FREQ DATA=PATIENTS; TITLE "Frequency Counts"; TABLES GENDER DX AE / NOCUM NOPERCENT;RUN;
The FREQ Procedure
Gender
GENDER Frequency-------------------2 1F 12M 13X 1f 2
Frequency Missing = 1
![Page 4: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/4.jpg)
Using PROC PRINT with a WHERE Statement
PROC PRINT DATA=PATIENTS; WHERE GENDER NOT IN ('F','M',' ') OR VERIFY(DX,' 0123456789') NE 0 OR AE NOT IN ('0','1',' '); TITLE "Listing of Invalid Data"; ID PATNO; VAR GENDER DX AE;RUN;
![Page 5: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/5.jpg)
Using PROC PRINT with a WHERE Statement
Listing of Invalid Data
PATNO GENDER DX AE
002 F X 0 003 X 3 1 XX5 M 1 A 010 f 1 0 013 2 1 002 F X 0 023 f 0 987 M 1.3 0
![Page 6: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/6.jpg)
Using a Data Step to Identify Invalid Character Values
DATA _NULL_; INFILE "C:PATIENTS.TXT" PAD; FILE PRINT; ***Send output to the output window; TITLE "Listing of Invalid Data"; INPUT @1 PATNO $3. @4 GENDER $1. @24 DX $3. @27 AE $1.; ***Check GENDER; IF GENDER NOT IN ('F','M',' ') THEN PUT PATNO= GENDER=; ***Check DX; IF VERIFY(DX,' 0123456789') NE 0 THEN PUT PATNO= DX=; ***Check AE; IF AE NOT IN ('0','1',' ') THEN PUT PATNO= AE=;RUN;
Listing of Invalid DataPATNO=002 DX=XPATNO=003 GENDER=XPATNO=XX5 AE=APATNO=010 GENDER=fPATNO=013 GENDER=2PATNO=002 DX=XPATNO=023 GENDER=fPATNO=987 DX=1.3
![Page 7: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/7.jpg)
Using PROC MEANS to Look for Outliers
PROC MEANS DATA=CLEAN.PATIENTS N NMISS MIN MAX MAXDEC=3; TITLE "Checking Numeric Variables"; VAR HR SBP DBP;RUN;
Checking Numeric Variables
Variable Label N Nmiss Minimum Maximum----------------------------------------------------------HR Heart Rate 27 3 10.000 900.000SBP Systolic Blood Pressure 26 4 20.000 400.000DBP Diastolic Blood Pressure 27 3 8.000 200.000----------------------------------------------------------
![Page 8: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/8.jpg)
Using PROC UNIVARIATE with an ODS Select Statement
ODS SELECT EXTREMEOBS;PROC UNIVARIATE DATA=CLEAN.PATIENTS; VAR HR SBP DBP; ID PATNO;RUN;
The UNIVARIATE ProcedureVariable: DBP (Diastolic Blood Pressure)
Extreme Observations
--------Lowest-------- --------Highest-------
Value PATNO Obs Value PATNO Obs
8 020 23 106 027 28 20 011 12 120 004 4 64 013 14 120 010 11 68 025 27 180 009 10 68 006 6 200 321 22
![Page 9: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/9.jpg)
Using the NEXTROBS Option with PROC UNIVARIATE
ODS SELECT EXTREMEOBS;PROC UNIVARIATE DATA=CLEAN.PATIENTS NEXTROBS=3; VAR HR SBP DBP; ID PATNO;RUN;
Variable: DBP (Diastolic Blood Pressure)
Extreme Observations
--------Lowest-------- --------Highest-------
Value PATNO Obs Value PATNO Obs
8 020 23 120 010 11 20 011 12 180 009 10 64 013 14 200 321 22
![Page 10: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/10.jpg)
Using a WHERE statement with PROC PRINT to list out-of-range data
PROC PRINT DATA=CLEAN.PATIENTS; WHERE HR NOT BETWEEN 40 AND 100 AND HR IS NOT MISSING OR SBP NOT BETWEEN 80 AND 200 AND SBP IS NOT MISSING OR DBP NOT BETWEEN 60 AND 120 AND DBP IS NOT MISSING; TITLE "Out-of-range Values for Numeric Variables"; ID PATNO; VAR HR SBP DBP;RUN;
![Page 11: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/11.jpg)
Using a WHERE statement with PROC PRINT to list out-of-range data
Out-of-range Values for Numeric Variables
PATNO HR SBP DBP
004 101 200 120 008 210 . . 009 86 240 180 010 . 40 120 011 68 300 20 014 22 130 90 017 208 . 84 321 900 400 200 020 10 20 8 023 22 34 78
![Page 12: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/12.jpg)
Using a DATA _NULL_ Data Step to list out-of-range data values
DATA _NULL_; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; FILE PRINT; ***output to the output Window; TITLE "Listing of Patient Numbers and Invalid Data Values"; INPUT @1 PATNO $3. @15 HR 3. @18 SBP 3. @21 DBP 3.; ***Check HR; IF (HR LT 40 AND HR NE .) OR HR GT 100 THEN PUT PATNO= HR=; ***Check SBP; IF (SBP LT 80 AND SBP NE .) OR SBP GT 200 THEN PUT PATNO= SBP=; ***Check DBP; IF (DBP LT 60 AND DBP NE .) OR DBP GT 120 THEN PUT PATNO= DBP=;RUN;
![Page 13: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/13.jpg)
Using a DATA _NULL_ Data Step to list out-of-range data values
Listing of Patient Numbers and Invalid Data ValuesPATNO=004 HR=101PATNO=008 HR=210PATNO=009 SBP=240PATNO=009 DBP=180PATNO=010 SBP=40PATNO=011 SBP=300PATNO=011 DBP=20PATNO=014 HR=22PATNO=017 HR=208PATNO=321 HR=900PATNO=321 SBP=400PATNO=321 DBP=200PATNO=020 HR=10PATNO=020 SBP=20PATNO=020 DBP=8PATNO=023 HR=22PATNO=023 SBP=34
![Page 14: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/14.jpg)
Using User Defined Formats to Detect Invalid Values
PROC FORMAT; VALUE $GENDER 'F','M' = 'Valid' ' ' = 'Missing' OTHER = 'Miscoded'; VALUE $DX '001' - '999'= 'Valid' ' ' = 'Missing' OTHER = 'Miscoded'; VALUE $AE '0','1' = 'Valid' ' ' = 'Missing' OTHER = 'Miscoded';RUN;
PROC FREQ DATA=CLEAN.PATIENTS; TITLE "Using FORMATS"; FORMAT GENDER $GENDER. DX $DX. AE $AE.; TABLES GENDER DX AE / NOCUM NOPERCENT;RUN;
Gender
GENDER FrequencyƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒMiscoded 4Valid 25
Frequency Missing = 1
![Page 15: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/15.jpg)
Using User-defined Formats and a PUT Function
DATA _NULL_; INFILE "C:PATIENTS.TXT" PAD; FILE PRINT; ***Send output to the output window; TITLE "Invalid Data Values"; INPUT @1 PATNO $3. @4 GENDER $1. @24 DX $3. @27 AE $1.; IF PUT(GENDER,$GENDER.) = 'Miscoded' THEN PUT PATNO= GENDER=; IF PUT(DX,$DX.) = 'Miscoded' THEN PUT PATNO= DX=; IF PUT(AE,$AE.) = 'Miscoded' THEN PUT PATNO= AE=;RUN;
Invalid Data ValuesPATNO=002 DX=XPATNO=003 GENDER=XPATNO=004 AE=APATNO=010 GENDER=fPATNO=013 GENDER=2PATNO=002 DX=XPATNO=023 GENDER=f
![Page 16: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/16.jpg)
Using PROC RANK to List the Highest and Lowest n% of the Data
%MACRO HI_LOW_P(DSN,VAR,PERCENT,IDVAR); ***Compute number of groups for PROC RANK; %LET GRP = %SYSEVALF(100 / &PERCENT,FLOOR); ***Value of the highest GROUP from PROC RANK, equal to the number of groups - 1; %LET TOP = %EVAL(&GRP - 1);
PROC FORMAT; VALUE RNK 0='Low' &TOP='High'; RUN;
PROC RANK DATA=&DSN OUT=NEW GROUPS=&GRP; VAR &VAR; RANKS RANGE; RUN;
***Sort and keep top and bottom n%; PROC SORT DATA=NEW (WHERE=(RANGE IN (0,&TOP))); BY &VAR; RUN;(continued)
![Page 17: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/17.jpg)
Using PROC RANK to List the Highest and Lowest n% of the Data
***Produce the report; PROC PRINT DATA=NEW; TITLE "Upper and Lower &PERCENT.% Values for %UPCASE(&VAR)"; ID &IDVAR; VAR RANGE &VAR; FORMAT RANGE RNK.; RUN;
PROC DATASETS LIBRARY=WORK NOLIST; DELETE NEW; RUN; QUIT;
%MEND HI_LOW_P;
![Page 18: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/18.jpg)
Using PROC RANK to List the Highest and Lowest n% of the Data
%HI_LOW_P(CLEAN.PATIENTS,SBP,10,PATNO)
Upper and Lower 10% Values for SBP
PATNO RANGE SBP
020 Low 20 023 Low 34 011 High 300 321 High 400
![Page 19: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/19.jpg)
Detecting Outliers Based on the Standard Deviation
Data set MEANS contains one observation:
Listing of Data Set MEANS
m_hr s_hr
104.871 153.026 continued...
***Output mean and standard deviations to a data set;proc means data=clean.patients noprint; var hr; output out=means(drop=_type_ _freq_) mean=m_hr std=s_hr;run;
![Page 20: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/20.jpg)
Detecting Outliers Based on the Standard Deviation
%let n_sd = 2;
***Two standard deviations gives approximately 5% of the outliers;
data _null_; file print; title "Statistics for Numeric Variables"; set clean.patients; if _n_ = 1 then set means; if hr lt m_hr - &n_sd*s_hr and hr ne . or hr gt m_hr + &n_sd*s_hr then put patno= hr=;run;
![Page 21: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/21.jpg)
Detecting Outliers Based on the Standard Deviation
Statistics for Numeric VariablesPATNO=321 HR=900
![Page 22: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/22.jpg)
Detecting Outliers Based on Trimmed Data
proc rank data=clean.patients out=tmp groups=4; var hr; ranks r_hr;run; proc means data=tmp noprint; where r_hr in (1,2); ***The middle 50%; var hr; output out=means(drop=_type_ _freq_) mean=m_hr std=s_hr;run;
continued...
A trimmed mean is a mean computed by first removing some of the highest and lowest values before doing the calculation.
![Page 23: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/23.jpg)
Detecting Outliers Based on Trimmed Data
data _null_; title "Outliers Based on Trimmed Data"; file print; set clean.patients; if _n_ = 1 then set means; if hr lt m_hr - &n_sd*2.63*s_hr and hr ne . or hr gt m_hr + &n_sd*2.63*s_hr then put patno= hr=;run;
%let n_sd = 2;
![Page 24: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/24.jpg)
Detecting Outliers Based on Trimmed Data
Outliers Based on Trimmed DataPATNO=008 HR=210PATNO=014 HR=22PATNO=017 HR=208PATNO=321 HR=900PATNO=020 HR=10PATNO=023 HR=22
![Page 25: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/25.jpg)
Detecting Outliers Based on Trimmed Data
proc rank data=clean.patients out=tmp groups=20; var hr; ranks r_hr;run; proc means data=tmp noprint; where r_hr not in (0,19); *The middle 90%; var hr; output out=means(drop=_type_ _freq_) mean=m_hr std=s_hr;run;
Program to Trim the Top and Bottom 5% of the Data
![Page 26: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/26.jpg)
Defining Interquartile Range
Median
Q3 (upper hinge)Q1 (lower hinge)
8274 100
1.5 x IQR
8 20 180 200
Outliers
IQR
Diastolic Blood Pressure (DBP)
Outliers
![Page 27: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/27.jpg)
Outliers Based on the Interquartile Range%MACRO INTER_Q(DSN,VAR,IDVAR,N_IQR); PROC MEANS DATA=&DSN NOPRINT; VAR &VAR; OUTPUT OUT=TMP Q3=UPPER Q1=LOWER QRANGE=IQR; RUN;
DATA _NULL_; TITLE "Outliers Based on &N_IQR Interquartile Ranges"; FILE PRINT; SET &DSN; IF _N_ = 1 THEN SET TMP; IF &VAR LT LOWER - &N_IQR*IQR AND &VAR NE . OR &VAR GT UPPER + &N_IQR*IQR THEN PUT &IDVAR= &VAR=; RUN;
PROC DATASETS LIBRARY=WORK NOLIST; DELETE TMP; RUN; QUIT;
%MEND INTER_Q;
![Page 28: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/28.jpg)
Outliers Based on the Interquartile Range
%INTER_Q(CLEAN.PATIENTS,SBP,PATNO,2);
Outliers Based on 2 Interquartile RangesPATNO=011 SBP=300PATNO=321 SBP=400
![Page 29: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/29.jpg)
Using Perl Regular ExpressionsFor Data Cleaning
![Page 30: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/30.jpg)
Some PERL Regular Expression Examples
Regular Expression
Matches
/cat/ the letters "cat"
/cat*/ the letters "ca" followed by zero or more "t"s
/cat+/ the letters "ca" followed by one or more "t"s
/c[aeiou]t/ a "c" followed by a vowel followed by the letter "t"
/\d\d/ any two digits
/\d\d+/ two or more digits
![Page 31: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/31.jpg)
PRXPARSE Syntax
RE = PRXPARSE("expression");
return code PERL regular expression
ExamplesRETURN = PRXPARSE("/cat/");RE = PRXPARSE("/\d\d+/");
![Page 32: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/32.jpg)
PRXMATCH Syntax
POS = PRXMATCH(return,string);
Position of thebeginning of thepattern. If notfound, returns azero
Return codefrom PRXPARSEFunction
Textstring
ExamplesPOS = PRXMATCH(RE,STRING);
RETURN = PRXPARSE("/cat/");P = PRXMATCH(RETURN,"This is a cat");Value of P is 11
![Page 33: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/33.jpg)
A Simple Example: Locating a SS NumberDATA FIND_SS; IF _N_ = 1 THEN RETURN = PRXPARSE("/\d\d\d-\d\d-\d\d\d\d/"); RETAIN RETURN; INPUT STRING $30.; POSITION = PRXMATCH(RETURN,STRING); IF POSITION GT 0 THEN OUTPUT;DATALINES;none on this lineyes! 123-45-6789 is onetwo 111-22-3333 444-55-6666; RETURN STRING POSITION
1 yes! 123-45-6789 is one 6 1 two 111-22-3333 444-55-6666 5
![Page 34: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/34.jpg)
Using PRXMATCH without using PRXPARSE (version 9.1)
DATA FIND_SS; INPUT STRING $30.; POSITION = PRXMATCH("/\d\d\d-\d\d-\d{4}/",STRING); IF POSITION GT 0 THEN OUTPUT;DATALINES;none on this lineyes! 123-45-6789 is onetwo 111-22-3333 444-55-6666;
STRING POSITION
yes! 123-45-6789 is one 6two 111-22-3333 444-55-6666 5
![Page 35: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/35.jpg)
Using Perl Regular Expressions for Data
CleaningDATA BAD_DATA; SET CLEAN.PATIENTS; IF PRXMATCH("/\d |\d\d |\d{3}/",DX) EQ 0 AND NOT MISSING(DX);RUN;
Listing of data set BAD_DATA
PATNO DX
002 X 002 X 987 1.3 986 1 5
![Page 36: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/36.jpg)
Some Regular Expression Solutions
DATA BAD_OBS; LENGTH ID $ 5; INPUT ID @@; *Valid ID's are X, Y, or Z followed by one or more digits; IF NOT PRXMATCH("/^X|Y|Z\d+/",ID);DATALINES;X12 C334 Y777 78Z 999;
Listing of data set BAD_OBS
ID
C33478Z999
![Page 37: Data Cleaning 101 Ron Cody, Ed.D Robert Wood Johnson Medical School Piscataway, NJ.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649dbe5503460f94ab2616/html5/thumbnails/37.jpg)
Penn State SAS Users Group
Open to all SAS users in the central PA area
Visit our website for more information and sign-up for our listserv.
http://www.pop.psu.edu/help/cacpri/psusug/