Dropping variables from a large SAS® data set when all ... · Dropping variables from a large...

13
1 Dropping variables from a large SAS® data set when all their values are missing Gwen Babcock, New York State Department of Health, Albany, NY ABSTRACT Several methods have been proposed to drop variables whose values are all missing from a SAS data set. These methods typically use PROC SQL, macro variables, and/or arrays, all of which may be slow or fail when applied to large data sets. One method uses hash tables and CALL EXECUTE, but it works only on character variables. Therefore, I have developed an alternative method which can be applied to data sets of any size. The strategy is to use PROC MEANS and PROC FREQ to identify variables whose values are all missing. These variables are then dropped using DATA steps generated by CALL EX- ECUTE. This method, which uses intermediate level coding techniques, will complement previously proposed methods be- cause it works well on large data sets, and on both character and numeric variables. An example is given using US Census American Community Survey data with SAS 9.3 under Windows XP. INTRODUCTION With relatively small data sets, you can often run PROC FREQ or some other procedure which provides descriptive statistics, visually inspect the results, and use the DROP data set option to remove variables whose values are all missing (Zdeb, 2011). However, for large data sets, this method is very tedious and time consuming. Therefore, two macros have been pro- posed to remove variables from a SAS data set when all its values are missing. The first, the %DROPMISS macro (Sridhar- ma , 2006 and 2010) fails with large data sets. The second, the %drop_blank_vars macro proposed by Tanimoto (2010), works only on character variables. Both use advanced coding techniques: Sridharma uses macros, and Tanimoto uses mac- ros and hash tables. The method proposed here uses intermediate coding techniques, is fast, and works on the largest data sets. In addition, with minor modifications, this method can be used to remove coded missing values, such as 99999999 or - 1, etc. PREPARATION To prepare to drop all variables with missing values, I first specified the libraries and data sets to use. I wished to preserve the label and compression of the original data set, so I obtained that information using PROC CONTENTS and placed it in macro variables using CALL SYMPUTX in a DATA step. (For a discussion of alternative ways to access table metadata, see Carpenter (2013)). I also make a copy of the original data set to use for manipulation, so that the original is preserved in case of need. /*specify any libraries used*/ libname sas 'E:\census\ACS\2007_2011\SAS\sasblock'; /*specify the input and output data sets*/ %let dsin=sas.acs2010_5yr_blkgrp; %let dsout=sas.acs2010_5yr_blkgrp2; /*specify an alternative missing value for numeric variables. this value should be a number*/ %let altnumericmissvalue=-1; proc contents data=&dsin noprint out=mycontents (keep=memlabel compress); run; data _null_;set mycontents(obs=1); call symputx("mylabel",memlabel); call symputx("mycompress",compress); run; data &dsout (compress &mycompress.); set &dsin; run; FINDING NUMERIC VARIABLES WITH ONLY MISSING VALUES Once you have finished the preparation, You can find the number of non-missing values for all numeric variables by using the “N” statistic in PROC MEANS.

Transcript of Dropping variables from a large SAS® data set when all ... · Dropping variables from a large...

Page 1: Dropping variables from a large SAS® data set when all ... · Dropping variables from a large SAS® data set when all their values are missing ... format vname $32.; ...

1

Dropping variables from a large SAS® data set when all their values aremissing

Gwen Babcock, New York State Department of Health, Albany, NY

ABSTRACTSeveral methods have been proposed to drop variables whose values are all missing from a SAS data set. These methodstypically use PROC SQL, macro variables, and/or arrays, all of which may be slow or fail when applied to large data sets.One method uses hash tables and CALL EXECUTE, but it works only on character variables. Therefore, I have developed analternative method which can be applied to data sets of any size. The strategy is to use PROC MEANS and PROC FREQ toidentify variables whose values are all missing. These variables are then dropped using DATA steps generated by CALL EX-ECUTE. This method, which uses intermediate level coding techniques, will complement previously proposed methods be-cause it works well on large data sets, and on both character and numeric variables. An example is given using US CensusAmerican Community Survey data with SAS 9.3 under Windows XP.

INTRODUCTIONWith relatively small data sets, you can often run PROC FREQ or some other procedure which provides descriptive statistics,visually inspect the results, and use the DROP data set option to remove variables whose values are all missing (Zdeb,2011). However, for large data sets, this method is very tedious and time consuming. Therefore, two macros have been pro-posed to remove variables from a SAS data set when all its values are missing. The first, the %DROPMISS macro (Sridhar-ma , 2006 and 2010) fails with large data sets. The second, the %drop_blank_vars macro proposed by Tanimoto (2010),works only on character variables. Both use advanced coding techniques: Sridharma uses macros, and Tanimoto uses mac-ros and hash tables. The method proposed here uses intermediate coding techniques, is fast, and works on the largest datasets. In addition, with minor modifications, this method can be used to remove coded missing values, such as 99999999 or -1, etc.

PREPARATIONTo prepare to drop all variables with missing values, I first specified the libraries and data sets to use. I wished to preservethe label and compression of the original data set, so I obtained that information using PROC CONTENTS and placed it inmacro variables using CALL SYMPUTX in a DATA step. (For a discussion of alternative ways to access table metadata, seeCarpenter (2013)). I also make a copy of the original data set to use for manipulation, so that the original is preserved in caseof need.

/*specify any libraries used*/libname sas 'E:\census\ACS\2007_2011\SAS\sasblock';

/*specify the input and output data sets*/%let dsin=sas.acs2010_5yr_blkgrp;%let dsout=sas.acs2010_5yr_blkgrp2;

/*specify an alternative missing value for numeric variables.this value should be a number*/%let altnumericmissvalue=-1;

proc contents data=&dsin noprint out=mycontents (keep=memlabel compress);run;

data _null_;set mycontents(obs=1);call symputx("mylabel",memlabel);call symputx("mycompress",compress);run;

data &dsout (compress &mycompress.);set &dsin;run;

FINDING NUMERIC VARIABLES WITH ONLY MISSING VALUESOnce you have finished the preparation, You can find the number of non-missing values for all numeric variables by using the“N” statistic in PROC MEANS.

Page 2: Dropping variables from a large SAS® data set when all ... · Dropping variables from a large SAS® data set when all their values are missing ... format vname $32.; ...

2

proc means data=&dsout noprint;output out=checknumeric(drop=_:) n=;/*get the number of NON-MISSING for all numeric varia-bles*/run; /*one observation, many variables*/

The data set output from PROC MEANS has only one observation. It has one variable for each numeric variable in the inputdata set. I transpose the data and select only those variables which have no values.

proc transpose data=checknumeric out=checknumeric2 (drop=_label_ rename=(col1=numhere_name_=vname));run;

data dropnumeric;set checknumeric2 (where=(numhere=0));run;

Obs vname numhere

1 B01001Ae1 0

2 B01001Ae2 0

3 B01001Ae3 0

4 B01001Ae4 0

5 B01001Ae5 0

6 B01001Ae6 0

7 B01001Ae7 0

8 B01001Ae8 0

9 B01001Ae9 0

10 B01001Ae10 0PROC MEANS is very fast. However, for datasets containing hundreds of variables and millions of observations, finding vari-ables with all missing values can be done more quickly by first testing a subset of the data. For example, the OBS=10000data set option can be used to test the first 10,000 observations. Only the variables whose values are all missing in the sub-set would be tested in the entire dataset. Appendix A shows the code for this approach.

FINDING NUMERIC VARIABLES WITH ONLY MISSING CODED VALUESIf the missing values are coded, a slight variation on the theme is needed. The assumption behind this code is that if themaximum value and minimum value are both equal to the coded missing value, then all of the values for the variable areequal to the coded missing value. First, I use PROC MEANS to get the minimum and maximum values of all the numericvariables, transpose them as before, and then merge them. Then I select those whose minimum and maximum values areboth equal to the coded missing value.

proc means data=&dsout noprint;output out=biggest(drop=_:) max=;run;proc means data=&dsout noprint;output out=smallest(drop=_:) min=;run;

Page 3: Dropping variables from a large SAS® data set when all ... · Dropping variables from a large SAS® data set when all their values are missing ... format vname $32.; ...

3

proc transpose data=biggest out=biggest2 (drop=_label_ rename=(col1=greatest _name_=vname));run;proc transpose data=smallest out=smallest2 (drop=_label_ rename=(col1=least _name_=vname));run;

proc sort data=biggest2; by vname; run;proc sort data=smallest2; by vname; run;

data dropnumericalt;merge biggest2 smallest2;by vname;if least=&altnumericmissvalue. and greatest=&altnumericmissvalue. then output;run;

Obs vname greatest least

1 B00001m1 -1 -1

2 B00002m1 -1 -1

3 B99011m1 -1 -1

4 B99011m2 -1 -1

5 B99011m3 -1 -1

6 B99012m1 -1 -1

7 B99012m2 -1 -1

8 B99012m3 -1 -1

9 B99021m1 -1 -1

10 B99021m2 -1 -1

FINDING CHARACTER VARIABLES WITH ONLY MISSING VALUESSince PROC MEANS can only be used with numeric variables, I use PROC FREQ to find character variables whose valuesare all missing. The NLEVELS option produces a list of variables with the number of distinct missing and non-missing values.If the number of distinct non-missing values is zero, then the variable should be dropped. I used the ODS OUTPUT state-ment to get the list of variables in a data set. I used _CHAR_ as a shortcut to list all of the character variables.

ods output NLevels=checkchar;ods html close; /*suppress default output, because I only want a data set*/proc freq data=&dsout nlevels;tables _char_ /noprint;run;ods html;ods output close;

Page 4: Dropping variables from a large SAS® data set when all ... · Dropping variables from a large SAS® data set when all their values are missing ... format vname $32.; ...

4

The resulting data set looks like this:

Obs TableVar TableVarLabel NLevels NMissLevels NNonMissLevels

1 FILEID File Identification 1 0 1

2 STUSAB State Postal Abbreviation 1 0 1

3 SUMLEVEL Summary Level 1 0 1

4 COMPONENT geographic Component 1 0 1

5 LOGRECNO Logical Record Number 15463 0 15463

6 US US 1 1 0

7 REGION Region 1 1 0

8 DIVISION Division 1 1 0

9 STATECE State (Census Code) 1 1 0

10 STATE State (FIPS Code) 1 0 1

From this data set, I select only the variables to be dropped:

data dropchar (rename=(TableVar=vname));set checkchar (where=(NNonMissLevels=0));run;

I could have used PROC FREQ with TABLES _ALL_ to get the missing values for all variables, including numeric ones, but itis slower than PROC MEANS. If your data set is small and you are not concerned about coded missing values, or you havefew numeric variables, then using PROC FREQ alone may be more convenient. However, if your dataset has millions of ob-servations, it would be faster to first use both PROC MEANS (for numeric variables) and PROC FREQ (for character varia-bles) on a subset of the observations. Only those variables whose values are all missing in that subset would be used in aPROC MEANS/PROC FREQ of the entire dataset. That would require more complicated code (see Appendix A). With any ofthese approaches, I could also choose to drop character variables with only one distinct value; the information they containmay be more appropriately placed in the metadata rather than the data set itself.

COMBINING AND PARSING THE LIST OF VARIABLES TO DROPA simple SET statement in a DATA step is used to combine the three lists of variables to drop. The new list contains oneobservation per variable to drop. However, for maximum speed, I want to drop as many variables as possible in one DATAstep. Therefore, the list of variables must be converted to the minimum number observations that can hold the list of varia-bles. Since the maximum length of a character variable is 32,767, and the maximum length of a variable name is 32, a char-acter variable should be able to hold a list of at least 992 SAS variable names. It may be able to hold more if the variablenames are shorter.

First, I combine the three lists of variables. The result is one data set containing one observation for each variable whosevalues are all missing or all the specified coded value.

data allvarstodrop;format vname $32.; /*maximum length of SAS variable name. Use this statement to make surethe names are not truncated*/set dropchar(keep=vname) dropnumeric (drop=numhere) dropnumericalt (drop=greatest least) ;varsize=length(vname)+1;run;

Next, I joined the variable names into lists using the CATX function. When the list reaches the maximum size of a SAS char-acter variable, or the last variable name is read, an observation is output.

Page 5: Dropping variables from a large SAS® data set when all ... · Dropping variables from a large SAS® data set when all their values are missing ... format vname $32.; ...

5

data varstodroplist (keep=varlist varsizesum);format varlist $32767.; /*this is the maximum size of a character variable*/retain varsizesum varlist;set allvarstodrop end=eof;varsizesum=sum(of varsize varsizesum);if varsizesum>32767 then do;

output;putlog vname varsizesum;varsizesum=varsize; varlist=vname; /*reset*/

end;else varlist=catx(' ',varlist,vname);if eof=1 then output;run;

DROPPING THE VARIABLESCALL EXECUTE (Whitlock, 1997) is used to generate the code that drops the variables with missing values. For each obse r-vation in the varstodroplist data set, a DATA step is generated which drops a set of variables. The label and compression ofthe original data set are used for the new data set.

data _null_;set varstodroplist;call execute("data &dsout(label='&mylabel' compress=&mycompress.);");call execute("set &dsout(drop=");call execute(varlist);call execute(");");call execute("run;");run;

Here is an example of the code generated by one iteration of this DATA step, with the list of variables to drop shortened forclarity. The full code as written into the SAS log is shown in Appendix B.

data sas.acs2011_5yr_blkgrp2(label='2007-2011 US Census American Survey estimates and marginsof error for NYS block groups, processed by gdb02' compress=BINARY);set sas.acs2011_5yr_blkgrp2(drop=B25129m31 B25129m32 B25129m33 B25129m34 B25129m35 B25129m36…);run;

CONCLUSIONSThis method for dropping variables whose values are all missing is fast and works on data sets with large numbers of varia-bles. It complements other methods. Since it does not use macros or hash tables, the code is less complex, and easier touse and debug. It can also be used to remove numeric variables whose values have all codes for missing, such as -1 or999999.

REFERENCES

Carpenter, Arthur L. (2013) “How Do I…?” There is more than one way to solve that problem; Why continuing to learn is soimportant. Proceedings of the SAS Global Forum 2013. 029-2013 p.10-12.

Page 6: Dropping variables from a large SAS® data set when all ... · Dropping variables from a large SAS® data set when all their values are missing ... format vname $32.; ...

6

Sridharma, S (2010) Dropping Automatically Variable with Only Missing Values. Proceedings of the SAS Global Forum2010. 048-2010.

Sridharma, S. (2006) How to Reduce the Disk Space Required by a SAS Data Set. NESUG 2006 Proceedings.

SAS Instutute, Inc. Sample 24622: Drop variables from a SAS data set when all its values are missing.http://support.sas.com/kb/24/622.html Accessed December 30, 2011.

Tanimoto, P. (2010) Meta-Programming in SAS with DATA Step. MWSUG Proceedings. 2010-124.

Whitlock, HI (1997) CALL EXECUTE: How and Why. Proceedings of the Twenty-Second Annual SAS® Users Group In-ternational Conference.

Xu, W. (2007) Identify and Remove Any Variables, Character or Numeric, That Have Only Missing Values . NESUG 2007Proceedings

Zdeb (2011) An Easy Route to a Missing Data Report with ODS+PROC FREQ+A Data Step. NESUG 2011 Proceedings.

ACKNOWLEDGMENTSThanks to Steve Forand, Thomas Talbot, and the New York State Department of Health, Center for Environmental Health forsupporting this work. Thanks to Mark H. Keintz for helping make the code faster.

CONTACT INFORMATIONYour comments and questions are valued and encouraged. Contact the author at:

Gwen D LaSelvaNew York State Department of HealthBureau of Environmental and Occupational EpidemiologyEmpire State Plaza-Corning Tower, Room 1203Albany, NY 12237Work Phone: 518-402-7785Fax: 518-402-7959Email: [email protected]

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Inst itute, Inc.in the USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.

APPENDIX A: FASTER VERSION OF CODE USING PRE-TESTINGThis code is more complex, but faster than that discussed previously, especially for datasets containing hundreds of variablesand millions of observations. The general approach is to pre-test a subset of observations using PROC MEANS or PROCFREQ to determine which variables contain only missing values in the subset. These variables are then tested in the entiredataset, and dropped if they contain only missing values.

/***************************************************************the purpose of this program is to drop variables with*missing values from a SAS dataset.*This code is designed to have a similar function to*The DROPMISS macro of Selvaratname Sridharma*see "Dropping Automatically Variables with Only Missing Values"* SAS Global Forum 2010*But the DROPMISS macro fails sometimes, so this is an alternative*Note: this program is more complex than the 'drop missing variables' program* but will be faster, especially on datasets with large numbers of observations* it was tested on NYS outpatient data with about 653 variables and over19 million observations*programmed by Gwen LaSelva*inputs:

library namename of input SAS datasetname of output SAS dataset

Page 7: Dropping variables from a large SAS® data set when all ... · Dropping variables from a large SAS® data set when all their values are missing ... format vname $32.; ...

7

alternative missing value for numeric variables, for example -1 or 999999sample size for preliminary determination of which variables to drop

*outputs: SAS dataset********************************************************************//*specify any libraries used*/libname sas 'E:\census\ACS\2007_2011\SASblockout';libname op "P:\Sections\EPHTdata\SPARCSdeidentified\Outpatient";

/*specify the input and output datasets*/%let dsin=op.outpatient_s_2011_prime;%let dsout=sas.want;

/*specify an alternative missing value for numeric variables.this value should be a number*/%let altnumericmissvalue=0;

/*specify the number of observations to sample. The size shouldbe a fraction of the total number of observations. It should beyour best guess of the minimum number of observations needed to findnon-missing values in all variables whose values are not all missing.Changing this will affect the speed of the program*/%let samplesize=10000;

/*put the time in the log*/data _null_;temptime=put(time(),hhmm5.);putlog "start time=" temptime;run;

/*shouldn't need to modify below this line*//*STEP 1: PREPARATION*//*get the label and compression of the input dataset and put them into a macro vari-ables,use them at the end so the new dataset has the same label and compression asthe original*/proc contents data=&dsin noprint out=mycontents (keep=memlabel compress);run;data _null_;set mycontents(obs=1);call symputx("mylabel",memlabel);call symputx("mycompress",compress);run;

/*first, make output dataset to work with, so we don't accidentally messup the original dataset if something goes wrong*/data &dsout (compress= &mycompress.);set &dsin;run;

/*STEP 2: FIND ALL NUMERIC VARIABLES WITH ALL MISSING VALUES*//*proc means with the N statistic should count number of non-missing values*//*start with a subset of observations*/proc means data=&dsout (obs=&samplesize.) noprint;output out=sampledropnumeric(drop=_:) n=;/*get the number of NON-MISSING for all nu-meric variables*/run;proc transpose data=sampledropnumeric out=sampledropnumeric2 (rename=(col1=numhere_name_=vname));

Page 8: Dropping variables from a large SAS® data set when all ... · Dropping variables from a large SAS® data set when all their values are missing ... format vname $32.; ...

8

run;data maybedropnumeric (keep=vname varsize);set sampledropnumeric2 (where=(numhere=0));varsize=length(vname)+1;run;

/*test the variables who are all missing in the first subset of observationsin the entire dataset*/data numvarstotestlist (keep=varlist varsizesum);format varlist $32767.; /*this is the maximum size of a character variable*/retain varsizesum varlist;set maybedropnumeric end=eof;varsizesum=sum(of varsize varsizesum);if varsizesum>32767 then do;

output;putlog vname varsizesum;varsizesum=varsize; varlist=vname; /*reset*/

end;else varlist=catx(' ',varlist,vname);if eof=1 then output;run;

data _null_;set numvarstotestlist ;call execute("proc means data=&dsout noprint;");call execute("output out=checknumeric1(drop=_:) n=;");call execute("var");call execute(varlist);call execute(";run;");if _n_=1 then call execute("data checknumeric; set checknumeric1; run;");else call execute("data checknumeric; merge checknumeric checknumeric1; run;");run;

proc transpose data=checknumeric out=checknumeric2 (drop=_label_ re-name=(col1=numhere _name_=vname));run;data dropnumeric;set checknumeric2 (where=(numhere=0));run;

/*Use PROC FREQ to get character variables whose values are all missingit is slower than PROC MEANS*/

/*start with a subset of observations*/ods output NLevels=maybedropchar;ods html close; /*suppress default output, because we only want a dataset*/proc freq data=&dsout (obs=&samplesize.) nlevels;tables _char_ /noprint;run;ods html;ods output close;

data maybedropchar2 (rename=(TableVar=vname));set maybedropchar (where=(NNonMissLevels=0));varsize=length(TableVar)+1;run;

Page 9: Dropping variables from a large SAS® data set when all ... · Dropping variables from a large SAS® data set when all their values are missing ... format vname $32.; ...

9

/*test the variables who are all missing in the first subset of observationsin the entire dataset*/data charvarstotestlist (keep=varlist varsizesum);format varlist $32767.; /*this is the maximum size of a character variable*/retain varsizesum varlist;set maybedropchar2 end=eof;varsizesum=sum(of varsize varsizesum);if varsizesum>32767 then do;

output;putlog vname varsizesum;varsizesum=varsize; varlist=vname; /*reset*/

end;else varlist=catx(' ',varlist,vname);if eof=1 then output;run;

data _null_;set charvarstotestlist ;call execute("ods output NLevels=checkchar1;ods html close;");call execute("proc freq data=&dsout nlevels; tables");call execute(varlist);call execute("/noprint; run; ods html; ods output close;");if _n_=1 then call execute("data dropchar; set checkchar1 (rename=(TableVar=vname)where=(NNonMissLevels=0)); run;");else call execute("data cropchar; set dropchar checkchar1 (rename=(TableVar=vname)where=(NNonMissLevels=0)); run;");run;/*You could use proc freq for all of the variables, but it would be slow*/

/*now, find out if any of the variables contain only the alternative missing value*//*let us assume that if the minimum and maximum values are both equal tothe alternative missing value, the variable should be dropped*//*start with a subset of observations*/proc means data=&dsout (obs=&samplesize.) noprint;output out=maybebiggest(drop=_:) max=;run;proc means data=&dsout (obs=&samplesize.) noprint;output out=maybesmallest(drop=_:) min=;run;

proc transpose data=maybebiggest out=maybebiggest2 (drop= rename=(col1=greatest_name_=vname));run;proc transpose data=maybesmallest out=maybesmallest2 (drop= rename=(col1=least_name_=vname));run;proc sort data=maybebiggest2; by vname; run;proc sort data=maybesmallest2; by vname; run;

data testnumericalt;merge maybebiggest2 maybesmallest2;by vname;varsize=length(vname)+1;if least=&altnumericmissvalue. and greatest=&altnumericmissvalue. then output;run;

Page 10: Dropping variables from a large SAS® data set when all ... · Dropping variables from a large SAS® data set when all their values are missing ... format vname $32.; ...

10

/*test the variables who are all the alternative missing value in the first subsetof observations in the entire dataset*/data numvarstotestlist (keep=varlist varsizesum);format varlist $32767.; /*this is the maximum size of a character variable*/retain varsizesum varlist;set testnumericalt end=eof;varsizesum=sum(of varsize varsizesum);if varsizesum>32767 then do;

output;putlog vname varsizesum;varsizesum=varsize; varlist=vname; /*reset*/

end;else varlist=catx(' ',varlist,vname);if eof=1 then output;run;

data _null_;set numvarstotestlist ;call execute("proc means data=&dsout noprint;");call execute("output out=biggest1(drop=_:) max=; var");call execute(varlist);call execute(";run;");call execute("proc means data=&dsout noprint;");call execute("output out=smallest1(drop=_:) min=; var");call execute(varlist);call execute(";run;");if _n_=1 then do;

call execute("data biggest; set biggest1; run;");call execute("data smallest; set smallest1; run;");

end;else do;

call execute("data smallest; merge smallest smallest1; run;");call execute("data biggest; merge biggest biggest1; run;");

end;run;

proc transpose data=biggest out=biggest2 (drop= rename=(col1=greatest_name_=vname));run;proc transpose data=smallest out=smallest2 (drop= rename=(col1=least _name_=vname));run;

proc sort data=biggest2; by vname; run;proc sort data=smallest2; by vname; run;

data dropnumericalt;merge biggest2 smallest2;by vname;if least=&altnumericmissvalue. and greatest=&altnumericmissvalue. then output;run;

Page 11: Dropping variables from a large SAS® data set when all ... · Dropping variables from a large SAS® data set when all their values are missing ... format vname $32.; ...

11

/*STEP 3: COMBINE LISTS OF CHARACTER AND NUMERIC VARIABLES TO DROP*//*for convience, combine the lists of character and numeric variables to drop*/data allvarstodrop;format vname $32.; /*maximum length of SAS variable name. Use this statement tomake sure the names are not truncated*/set dropchar(keep=vname) dropnumeric (drop=numhere) dropnumericalt (drop=greatestleast) ;varsize=length(vname)+1;run;*proc print data=allvarstodrop (obs=10);run;

/*STEP 4: DROP THE VARIABLES*//* create a character variable containing the list of variablesto be dropped. A character variable can be up to 32767 characters,and can therefore hold a list of at least 992 SAS variables, maybe more*//*if the sum of varsize is more than 32767 need to create additionalobservations in the dataset to hold the additional variables*/data varstodroplist (keep=varlist varsizesum);format varlist $32767.; /*this is the maximum size of a character variable*/retain varsizesum varlist;set allvarstodrop end=eof;varsizesum=sum(of varsize varsizesum);if varsizesum>32767 then do;

output;putlog vname varsizesum;varsizesum=varsize; varlist=vname; /*reset*/

end;else varlist=catx(' ',varlist,vname);if eof=1 then output;run;

/*use call execute to repeat the dropping for each set of variable names*/data _null_;set varstodroplist;call execute("data &dsout(label='&mylabel' compress=&mycompress.);");call execute("set &dsout(drop=");call execute(varlist);call execute(");");call execute("run;");run;/*put the end time in the log*/data _null_;temptime=put(time(),hhmm5.);putlog "end time=" temptime;run;

APPENDIX B: CODE GENERATED BY THE FINAL DATA STEP CALL EXECUTE STATEMENTSThis code was created by one iteration of the final DATA step, using CALL EXECUTE statements. It is this code which dropsthe variables that were previously determined to have only missing values. The code is shown as it appears in the SAS log.

1219 + data sas.acs2011_5yr_blkgrp2(label='2007-2011 US Census American Survey estimates andmargins of error for NYS block groups, processed by gdb02' compress=BINARY);1220 + set sas.acs2011_5yr_blkgrp2(drop=1221 + B25129m31 B25129m32 B25129m33 B25129m34 B25129m35 B25129m36 B25129m37 B25129m38B25129m39

Page 12: Dropping variables from a large SAS® data set when all ... · Dropping variables from a large SAS® data set when all their values are missing ... format vname $32.; ...

12

B25129m40 B25129m41 B25129m42 B25129m43 B25129m44 B25129m45 B25129m46 B25129m47 B25129m48B25129m49 B25129m50 B25129m51 B25129m52 B25129m53 B25129m54 B25129m551222 + B25129m56 B25129m57 B25129m58 B25129m59 B25129m60 B25129m61 B25129m62 B25129m63B25129m64B25129m65 B25129m66 B25129m67 B25129m68 B25129m69 B25129m70 B25129m71 B25129m72 B25129m73B25129m74 B25129m75 B25129m76 B25129m77 B25129m78 B25129m79 B25129m801223 + B25129m81 B25129m82 B25129m83 B25129m84 B25129m85 B25129m86 B25129m87 B26001e1B26001m1B98001e1 B98001e2 B98002e1 B98002e2 B98011e1 B98012e1 B98012e2 B98012e3 B98013e1 B98013e2B98013e3 B98013e4 B98013e5 B98013e6 B98013e7 B98014e1 B98021e1 B98021e21224 + B98021e3 B98021e4 B98021e5 B98021e6 B98021e7 B98021e8 B98021e9 B98022e1 B98022e2B98022e3B98022e4 B98022e5 B98022e6 B98022e7 B98022e8 B98022e9 B98022e10 B98031e1 B98032e1 B98001m1B98001m2 B98002m1 B98002m2 B98011m1 B98012m1 B98012m2 B98012m3 B98013m11225 + B98013m2 B98013m3 B98013m4 B98013m5 B98013m6 B98013m7 B98014m1 B98021m1 B98021m2B98021m3B98021m4 B98021m5 B98021m6 B98021m7 B98021m8 B98021m9 B98022m1 B98022m2 B98022m3 B98022m4B98022m5 B98022m6 B98022m7 B98022m8 B98022m9 B98022m10 B98031m1 B98032m11226 + B99052PRe1 B99052PRe2 B99052PRe3 B99052PRe4 B99052PRe5 B99052PRe6 B99052PRe7 B99085e1B99085e2 B99085e3 B99052PRm1 B99052PRm2 B99052PRm3 B99052PRm4 B99052PRm5 B99052PRm6B99052PRm7 B99085m1 B99085m2 B99085m3 B99086e1 B99086e2 B99086e3 B99087e1 B99087e21227 + B99087e3 B99087e4 B99087e5 B99088e1 B99088e2 B99088e3 B99088e4 B99088e5 B99089e1B99089e2B99089e3 B99086m1 B99086m2 B99086m3 B99087m1 B99087m2 B99087m3 B99087m4 B99087m5 B99088m1B99088m2 B99088m3 B99088m4 B99088m5 B99089m1 B99089m2 B99089m3 B99131e11228 + B99131e2 B99131e3 B99132e1 B99132e2 B99132e3 B99221e1 B99221e2 B99221e3 B99244e1B99244e2B99244e3 B99245e1 B99245e2 B99245e3 B99246e1 B99246e2 B99246e3 B992523e1 B992523e2 B992523e3B99131m1 B99131m2 B99131m3 B99132m1 B99132m2 B99132m3 B99221m1 B99221m21229 + B99221m3 B99244m1 B99244m2 B99244m3 B99245m1 B99245m2 B99245m3 B99246m1 B99246m2B99246m3B992523m1 B992523m2 B992523m3 B00001m1 B00002m1 B99011m1 B99011m2 B99011m3 B99012m1 B99012m2B99012m3 B99021m1 B99021m2 B99021m3 B99031m1 B99031m2 B99031m3 B99051m11230 + B99051m2 B99051m3 B99051m4 B99051m5 B99051m6 B99051m7 B99052m1 B99052m2 B99052m3B99052m4B99052m5 B99052m6 B99052m7 B99061m1 B99061m2 B99061m3 B99071m1 B99071m2 B99071m3 B99072m1B99072m2 B99072m3 B99072m4 B99072m5 B99072m6 B99072m7 B99080m1 B99080m21231 + B99080m3 B99081m1 B99081m2 B99081m3 B99081m4 B99081m5 B99082m1 B99082m2 B99082m3B99082m4B99082m5 B99083m1 B99083m2 B99083m3 B99083m4 B99083m5 B99084m1 B99084m2 B99084m3 B99084m4B99084m5 B99092m1 B99092m2 B99092m3 B99102m1 B99102m2 B99102m3 B99103m11232 + B99103m2 B99103m3 B99104m1 B99104m2 B99104m3 B99104m4 B99104m5 B99104m6 B99104m7B99121m1B99121m2 B99121m3 B99141m1 B99141m2 B99141m3 B99142m1 B99142m2 B99142m3 B99151m1 B99151m2B99151m3 B99161m1 B99161m2 B99161m3 B99162m1 B99162m2 B99162m3 B99162m41233 + B99162m5 B99162m6 B99162m7 B99163m1 B99163m2 B99163m3 B99163m4 B99163m5 B99171m1B99171m10B99171m11 B99171m12 B99171m13 B99171m14 B99171m15 B99171m2 B99171m3 B99171m4 B99171m5B99171m6B99171m7 B99171m8 B99171m9 B99172m1 B99172m10 B99172m11 B99172m121234 + B99172m13 B99172m14 B99172m15 B99172m2 B99172m3 B99172m4 B99172m5 B99172m6 B99172m7B99172m8 B99172m9 B99191m1 B99191m2 B99191m3 B99191m4 B99191m5 B99191m6 B99191m7 B99191m8B99192m1 B99192m2 B99192m3 B99192m4 B99192m5 B99192m6 B99192m7 B99192m8 B99193m1

Page 13: Dropping variables from a large SAS® data set when all ... · Dropping variables from a large SAS® data set when all their values are missing ... format vname $32.; ...

13

1235 + B99193m2 B99193m3 B99193m4 B99193m5 B99193m6 B99193m7 B99193m8 B99194m1 B99194m2B99194m3B99194m4 B99194m5 B99194m6 B99194m7 B99194m8 B99201m1 B99201m2 B99201m3 B99201m4 B99201m5B99201m6 B99201m7 B99201m8 B99211m1 B99211m2 B99211m3 B99212m1 B99212m21236 + B99212m3 B99231m1 B99231m2 B99231m3 B99232m1 B99232m2 B99232m3 B99233m1 B99233m2B99233m3B99233m4 B99233m5 B99234m1 B99234m2 B99234m3 B99234m4 B99234m5 B99241m1 B99241m2 B99241m3B99242m1 B99242m2 B99242m3 B99243m1 B99243m2 B99243m3 B992510m1 B992510m21237 + B992510m3 B992511m1 B992511m2 B992511m3 B992512m1 B992512m2 B992512m3 B992513m1B992513m2B992513m3 B992514m1 B992514m2 B992514m3 B992515m1 B992515m2 B992515m3 B992516m1 B992516m2B992516m3 B992518m1 B992518m2 B992518m3 B992519m1 B992519m2 B992519m31238 + B992520m1 B992520m2 B992520m3 B992521m1 B992521m2 B992521m3 B992522m1 B992522m2B992522m3B992522m4 B992522m5 B992522m6 B992522m7 B99252m1 B99252m2 B99252m3 B99253m1 B99253m2 B99253m3B99254m1 B99254m2 B99254m3 B99255m1 B99255m2 B99255m3 B99256m11239 + B99256m2 B99256m3 B99257m1 B99257m2 B99257m3 B99258m1 B99258m2 B99258m3 B99259m1B99259m21240 + );1241 + run;

NOTE: There were 15463 observations read from the data set SAS.ACS2011_5YR_BLKGRP2.NOTE: The data set SAS.ACS2011_5YR_BLKGRP2 has 15463 observations and 6462 variables.NOTE: Compressing data set SAS.ACS2011_5YR_BLKGRP2 decreased size by 73.58 percent.

Compressed is 4094 pages; un-compressed would require 15495 pages.NOTE: DATA statement used (Total process time):

real time 7.93 secondscpu time 7.28 seconds