SAS Essentials I: SAS Functions for a Better Functioning ... · The LIBNAME statement will write...

15
1 SAS Essentials I: SAS ® Functions for a Better Functioning Community AnnMaria De Mars, The Julia Group, Santa Monica, CA ABSTRACT Employers want work done by experienced programmers. New programmers want to gain experience. Communities want, but often can't afford, more people to analyze data and to explain those analyses in plain language. One way to break out of that vicious circle of "can't get a job without experience, can't get experience without a job" is to take advantage of the many opportunities to use freely available data to help inform your community. The beauty of it is that using these data sets can provide you the experiences that will advance you from being a novice programmer. This paper uses open data to demonstrate how new SAS programmers can download, analyze and present results from large and complex data sets. In the process, examples are presented of the experiences programmers can gain in making the appropriate design choices, applying a broad range of functions and procedures. Character functions of PUT, VVALUE, %LENGTH and INDEX can all be used for text processing. Analyses are presented demonstrating when to use which function and why. PROC SQL, PROC TRANSPOSE, PROC MEANS, PROC FREQ and macro programming are used in new and old ways to get a new look at data in the Trends in Mathematics and Science Study to provide information teachers can use in their classrooms. INTRODUCTION Employers want work done by experienced programmers. New programmers want to gain experience. Communities can't afford the people they need to analyze data and to explain those analyses in plain language. One way to break out of that vicious circle of "can't get a job without experience, can't get experience without a job" is to take advantage of the many opportunities to use open data to help inform your community. GETTING THE DATA - AND MORE “Open data”, that is data freely available to anyone to use or publish, is available from an enormous range of government and non-profit sources. Because one of the causes I care deeply about is American education, one of my favorite sites for open data is the National Center for Education Statistics, which is where I obtained this data set, the Trends in International Mathematics and Science Study (TIMSS). The data from this study provide a lot of opportunities to put SAS through its paces because it is a complex sample, with multiple data sets and a large number of different formats. More sources for open data are discussed in SAS Essentials II: Better looking SAS for a better community (De Mars, 2011a). Getting the data is a piece of cake. Just go to the NCES TIMSS page (http://nces.ed.gov/pubsearch/getpubcats.asp?sid=073 ) and click on “Data Products”. Select the data you want to download. These examples use the 2007 public-use data file for grade 8. If you haven’t worked on large scale studies, here is your chance to see how one can be organized. The TIMSS data are documented well. Along with a six raw data files for each grade, you’ll download six codebooks and six SAS programs. Notice that I said this is how a project can be organized. It’s not the only way and, in my opinion, not always the most efficient way, but it is set up to make it fairly easy for the data to be used by someone other than the original research team. HOW TO CREATE A DATA SET FOR ANALYSIS Moving from all those files, and all those programs with hundreds, sometimes thousands, of lines of code can be daunting and you might be tempted to quit before really getting started. To begin with, many of the SAS programs distributed with open data sets are, to put it politely, not best practices. Don’t try to model your programs after these files. Seriously, don’t. Many have an INPUT statement that is hundreds of lines long, with one variable per line. Comments, if they exist at all, are usually very short and not particularly helpful. /* This is the program to read the data */ To top it off, not all new programmers use filename statements. They may only be running reports against SAS data sets already created or importing from Excel files. If this describes you, you may be frustrated at the start looking at a program that’s several hundred lines long just to read in the data. As the Hitchhiker’s Guide to the Galaxy wisely advises, Don’t Panic. There are only three statements you absolutely need to change in the code you download, the LIBNAME, FILENAME and DATA statements.

Transcript of SAS Essentials I: SAS Functions for a Better Functioning ... · The LIBNAME statement will write...

1

SAS Essentials I: SAS® Functions for a Better Functioning Community AnnMaria De Mars, The Julia Group, Santa Monica, CA

ABSTRACT Employers want work done by experienced programmers. New programmers want to gain experience. Communities want, but often can't afford, more people to analyze data and to explain those analyses in plain language. One way to break out of that vicious circle of "can't get a job without experience, can't get experience without a job" is to take advantage of the many opportunities to use freely available data to help inform your community. The beauty of it is that using these data sets can provide you the experiences that will advance you from being a novice programmer. This paper uses open data to demonstrate how new SAS programmers can download, analyze and present results from large and complex data sets. In the process, examples are presented of the experiences programmers can gain in making the appropriate design choices, applying a broad range of functions and procedures. Character functions of PUT, VVALUE, %LENGTH and INDEX can all be used for text processing. Analyses are presented demonstrating when to use which function and why. PROC SQL, PROC TRANSPOSE, PROC MEANS, PROC FREQ and macro programming are used in new and old ways to get a new look at data in the Trends in Mathematics and Science Study to provide information teachers can use in their classrooms.

INTRODUCTION Employers want work done by experienced programmers. New programmers want to gain experience. Communities can't afford the people they need to analyze data and to explain those analyses in plain language. One way to break out of that vicious circle of "can't get a job without experience, can't get experience without a job" is to take advantage of the many opportunities to use open data to help inform your community.

GETTING THE DATA - AND MORE “Open data”, that is data freely available to anyone to use or publish, is available from an enormous range of government and non-profit sources. Because one of the causes I care deeply about is American education, one of my favorite sites for open data is the National Center for Education Statistics, which is where I obtained this data set, the Trends in International Mathematics and Science Study (TIMSS). The data from this study provide a lot of opportunities to put SAS through its paces because it is a complex sample, with multiple data sets and a large number of different formats. More sources for open data are discussed in SAS Essentials II: Better looking SAS for a better community (De Mars, 2011a).

Getting the data is a piece of cake. Just go to the NCES TIMSS page (http://nces.ed.gov/pubsearch/getpubcats.asp?sid=073) and click on “Data Products”. Select the data you want to download. These examples use the 2007 public-use data file for grade 8. If you haven’t worked on large scale studies, here is your chance to see how one can be organized. The TIMSS data are documented well. Along with a six raw data files for each grade, you’ll download six codebooks and six SAS programs. Notice that I said this is how a project can be organized. It’s not the only way and, in my opinion, not always the most efficient way, but it is set up to make it fairly easy for the data to be used by someone other than the original research team.

HOW TO CREATE A DATA SET FOR ANALYSIS Moving from all those files, and all those programs with hundreds, sometimes thousands, of lines of code can be daunting and you might be tempted to quit before really getting started. To begin with, many of the SAS programs distributed with open data sets are, to put it politely, not best practices. Don’t try to model your programs after these files. Seriously, don’t. Many have an INPUT statement that is hundreds of lines long, with one variable per line. Comments, if they exist at all, are usually very short and not particularly helpful.

/* This is the program to read the data */

To top it off, not all new programmers use filename statements. They may only be running reports against SAS data sets already created or importing from Excel files. If this describes you, you may be frustrated at the start looking at a program that’s several hundred lines long just to read in the data. As the Hitchhiker’s Guide to the Galaxy wisely advises, Don’t Panic. There are only three statements you absolutely need to change in the code you download, the LIBNAME, FILENAME and DATA statements.

2

Change the program you downloaded to be like this (assuming you’re using Windows):

LIBNAME lib 'C:\Users\AnnMaria\Documents\timss\sasdata';

The LIBNAME statement will write the data to whatever directory you specified between the quotation marks. This directory must already exist. SAS won’t create it for you.

FILENAME in1 'C:\Users\AnnMaria\Documents\timss\rawdata\g8_achieve07.txt';

The filename gives, in quotes, the directory and file name of the raw data you downloaded.

DATA lib.g8_achieve07 ;

The structure for the new file you are creating is libref.data setname

The libref must match what is in the LIBNAME statement. The libref is what experienced programmers call a library reference (also known as the directory where your file is located but libref is shorter to say.) You can name the data set any valid SAS name.

INFILE in1 ;

You shouldn’t have to touch this statement, but just look at it and be sure that whatever comes after the INFILE is the same name that comes after FILENAME. This is a fileref , a fond nickname for file reference.

DEALING WITH FORMATS IN OPEN DATA You could stop at this point and just run the program. While there are only three statements you have to change, I’d recommend you insert this statement as the first line in your program:

OPTIONS NOFMTERR ;

Most open data sets have a large number of formats. This only makes sense. Formats can provide information to the end users that only those involved in the original research project would know. Missing data codes are one common reason for the need for formats. For example, if you did not collect the data yourself you likely have no idea why so many people had a score of “8” on a scale of 1 to 5. Were they all just too stupid to realize that 8 doesn’t come between 1 and 5? Usually you’ll find out that there is a very logical reason, along the lines of the question was “How satisfied are you with your child’s school?” and the people who were assigned a score of 8 didn’t have any children. Because SAS doesn’t have pre-set formats for these questions, the research staff writes their own formats for each data set.

I recommend using the OPTIONS and %INCLUDE statements to minimize your format problems. At a later time you’ll probably be interested in any user-written formats but this isn’t the time. You don’t want your program stopping and giving you errors because you don’t have some special format. There shouldn’t be any format errors because if a program you download has user-written formats they should have included all of the formats necessary in either a FORMAT procedure at the beginning of the program or in a separate SAS program. Since that doesn’t always happen, I recommend starting your program with the OPTIONS statement specifying not to quit when it encounters a format error.

Next, I cut and pasted the entire FORMAT procedure into a new file. I named it achieveformats.sas. Now to include these formats in any program, use this statement:

%INCLUDE ‘'C:\Users\AnnMaria\Documents\timss\achieveformats.sas’ ;

What the %INCLUDE statement does is include code from another file. It will insert it into your program and run it just as if it was written right in that spot. You need to put the %INCLUDE at the beginning of your program so that the FORMAT procedure is run before the DATA step when those formats are going to be used.

The %INCLUDE statement accomplishes two purposes. First, moving the FORMAT procedure makes the program hundreds of lines shorter and thus easier to de-bug. Second, I may need those formats in many other cases, say, when I'm running statistics on these variables. I can just paste that one line at the top of any program where I need.

Here you go. The first few lines of your program should look like the example below. Leave the other thousand lines the way they are in the file you downloaded and hit RUN.

OPTIONS NOFMTERR ; %INCLUDE ‘'C:\Users\AnnMaria\Documents\timss\sascode\achieveformats.sas’ ; LIBNAME lib 'C:\Users\AnnMaria\Documents\timss\sasdata'; FILENAME in1 'C:\Users\AnnMaria\Documents\timss\rawdata\g8_achieve07.txt'; DATA lib.g8_achieve07 ; INFILE in1 ;

3

Your mileage may vary: Using open data to become an expert programmer

Someone commented on my blog that he determined in an interview whether or not someone is a novice programmer by asking the person to tell him about some interesting design choices he or she had made recently and why did they make those choices. This is a really insightful comment, and as the purpose of the SAS Essentials section is to help you make that move to more experienced programmer throughout this paper are examples of different choices I made and the reasons you might choose differently. I used a %INCLUDE for formats because I don’t want to scroll through 400 extra lines of PROC FORMAT every time I’m looking at my log and program. Also, I will probably need those formats when I am running analyses against these data later, so the %INCLUDE is an easy way for me to include them without copying and pasting hundreds of lines. The reason I never permanently store formats is related to the nature of my business. I use a lot of public data sets for various purposes. I have clients from vocational rehabilitation programs to the Social Security Administration to nursing school faculty to you name it. Some of these data sets have 500 or more user-written formats. I wouldn’t do it that way, but these are other people’s data. If I work with 100 different data sets a year, the odds are great that at some point I will have the same format name conflicting and write over some format I wanted, end up using the format for Client A’s project that belongs with Client B’s – it’s just making one more accident prone to happen in my environment. Even if they don’t conflict, I don’t like the idea of having 10,000 different formats. Now, if I worked on say, Medicaid prescription drug data all day, day in and day out, it would make perfect sense for me to store formats for those data permanently. Understanding the design choices you make and why is a really crucial distinction between novice and experienced programmers. For one reason, a more experienced programmer knows more than one way to accomplish a task. The second demonstration of experience comes from knowing when to apply each possibility. I once worked on a project for a small non-profit where their database was set up in a star schema, with client ID tables, activity tables and so on. The entire database didn’t have more than a few thousand records in it. When I asked the programmer why he did it this way, he acted liked I was the stupidest person on earth and told me that at the school where he got his certificate they told them that was the most efficient way for retrieval of records, took the least time, etc. All of which was true, except that he wasn’t programming for Google but for a small non-profit and it would have been much easier for them if he had one long file with all of their variables. The whole thing wouldn’t have taken but a few megabytes of space and then other people in the organization who were not very experienced programmers would not have had to bothered with joining tables and selecting variables because everything would have been in one place. In fact, one of the reasons that programmer was eventually let go is that the management was very unhappy with how difficult the systems he designed were to use. Similarly, if I only had one or two formats defined, maybe it isn’t worth the trouble of using the %INCLUDE statement. Remember Eagleson’s Law:

Any Code Of Your Own That You Haven't Looked At For Six Or More Months, Might As Well Have Been Written By Someone Else.

Sticking in an extra statement that calls code stored somewhere external to your program is adding an extra layer of complexity. Whenever any decision makes your program more difficult to understand and debug, you always want to ask yourself whether that additional complexity is worth it. Here then, is one of the benefits of working with open data. There are an enormous number of data sets out there in a mind-boggling variety. Working with these gives you the opportunity through trial, error and experimentation to learn what works and why, all without the worry of deleting your company’s customer database by mistake and getting fired. If you totally screw up with your open data project, you can trash the whole thing, download the data again and start over.

DATA QUALITY 1.O OR “THESE ARE NOT THE DATA YOU’RE LOOKING FOR” Congratulations! You read in your data. Now what? Any time you receive any data set, the first step is to get some insight into your data. Now you may be happy you created that format file. The first thing anyone who gets a data set from anybody ought to do is: PROC CONTENTS DATA = lib.g8_achieve07 ;

4

When scanning through here, you’re looking for a few things: • Is this actually the data set you are expecting to receive? Did you accidentally read the file on

demographics instead of achievement? (Hey, nobody’s perfect.) You’ll know that after a minute or two of looking at the variable names and labels.

• Are these the variables you’re looking for? How many variables are included? What are the variable

names? Looking at the labels may give you a description of what the variable actually measures.

• Are there any format problems? You would think that people would have defined the answers to test questions as numeric, so you could add them up and get a total score, that is 0 or 1 for wrong or right. If the items on a test are all defined as character then you’re going to have to recode those before you can add them to get a total score. What are the formats? Are there user-defined formats that you don’t have?

Data triage: Finding and fixing data disasters Finding and fixing data errors is a two-step process. You need to do both of these but if you don’t do the first, which I refer to as “data triage”, I just might come to your house and personally give you a good talking to. The first step uses PRINT, MEANS and FREQ procedures to quickly identify any glaringly obvious problems with your data. Here’s what to do and why you do it. Print out the first 10 records. Learn to stare at your data. I’m serious. If there’s ever a meeting where the results are completely wrong because of some fundamental flaw in the data, you don’t want to be the person the client is pointing at as he screams, “Didn’t you even LOOK at these data?”

PROC PRINT DATA = lib.g8_achieve07 (OBS = 10 ) ; FORMAT _ALL_ ;

Don’t forget the OBS = 10 ! You don’t want to print out a data set of 357,000 records! If something is completely off, it should show up in the first 10 records. You may be thinking that you should print more than 10, but remember, many of these data sets have 500 or more variables, and there are more efficient ways to analyze your data than looking at >5,000 numbers. The FORMAT statement without a format given and with the word _ALL_ causes SAS to produce the unformatted values for all variables. Right now, you are just looking for glaring errors, like values entered for gender are “dog”, “tiger” and “pickle”. If an error was made in the INPUT statement and now all of the data are off by one column, you should spot that right away by looking at your first ten records. Below is the printout from the actual TIMSS student achievement data. The variables below are multiple choice items with answers of A, B, C or D or questions that require students to provide answers, such as “What is three cubed?” The values don’t fit what you would expect.

5

Next, let’s look at the output from the PROC PRINT with the formatted values, created from this statement:

PROC PRINT DATA = lib.g8_achieve07 (obs =10) ;

Obs M042079 M042018 M042055 M042039 M042199 M042301A M042301B

1 C* CORRECT RESPONSE

D A* D* INCORRECT RESPONSE

INCORRECT RESPONSE

2 NOT ADMIN.

NOT ADMIN.

NOT ADMIN.

NOT ADMIN.

NOT ADMIN.

NOT ADMIN. NOT ADMIN.

Let’s pause to think about what this output means. We can see that for some items, the data were entered as 1= A, 2 = B and so on. If the multiple choice was correct, there was an asterisk next to it. If the item was not administered to a student, for some items (the multiple choice ones) a value of 8 was entered. For other items (the ones that required students to compute an answer, a score of 10 was entered for a correct response, 70 or 79 for an incorrect response and a 98 if it was not administered. PROC MEANS DATA = lib.g8_achieve07 ; At this point, you’re only looking for one thing and that is if values are out of range, for example, the items are scored on a 1 – 5 scale and the maximum of many of the variables is 8. Occasionally for perfectly good reasons other than to annoy you, people code data as 8, 99 or whatever to show that it was “not administered”, “not answered” , “did not know” and so on. When you look at the output obtained from PROC MEANS, you can see that is exactly what happened and now you have a lot of data that are out of range, completely throwing off your average values.

There are some interesting analyses that can be done of patterns of missing data, but now is not the time to do them. If you want to use those missing value codes later, good for you, but given how many times these have caused completely incorrect results, you probably want to set these out-of-range values to missing in your analytic file. If you do need an analysis on non-respondents by type, you can always go back to the raw data file and read it in with the missing codes. PROC FREQ DATA = lib.g8_achieve07 NOPRINT ;

TABLES idvar / OUT = achievefreq (WHERE = ( COUNT > 1 )) ; FORMAT idvar ;

This is the step where you identify duplicate ID values. In almost every circumstance, you are going to have a unique identifier that is supposed to be, well, unique. This can be a social security number, transaction ID, employee number, whatever. Be sure you don’t forget the NOPRINT option !!! If you happen to leave off the WHERE clause in the next statement or make some other error, the last thing you want is a frequency table of the 2,000,000 user ids printed out. The frequencies are going to be output to a file named achievefreq. Only values where the count is greater than 1 will be written to that data set. Also, note that there are two sets of parentheses in the WHERE clause. I want to force it to use the unformatted value, so included a FORMAT statement with the variable name, followed by no format whatsoever. The next step prints the first ten duplicate ID numbers. I used the obs = 10 option because just in case I used the wrong variable, or I forgot the count > 1 or accidentally typed count = 1 or one of a bunch of different reasons I might have gotten 2,000,000 records in this data set, I don’t want them all printing out.

PROC PRINT DATA = achievefreq (OBS = 10 ) ; Your mileage may vary: Using SAS Enterprise Guide for Data Triage Here is another example of making a different programming choice. Sas Enterprise Guide provides a second option for a first look at your data. Go to the TASKS menu, select DESCRIBE and then CHARACTERIZE DATA

6

A window will pop up that shows the data set to be used. (If your data set is not shown, click on the ADD button in that window to add it.) Click NEXT. The next window to pop-up gives you report options. The button next to Summary Report is clicked by default. Leave it that way. Click next to the Graphs button to de-select it. With hundreds of variables, creating the graphs takes a long time and doesn’t give that much additional information over the results from the frequencies and descriptive statistics. By default, SAS saves the output of the descriptive and frequency distributions as data sets. The frequency data set gives you the frequencies for the categorical variables only. Of course, you only get descriptive statistics for the numeric variables because you cannot get a mean for “ethnicity” or a maximum for “city”.

You may want to save these data sets if you plan on doing analysis with them later. At this point, I don’t have any such plans, so I de-select the button next to SAS Data Sets. Because most of the government datasets you can download will have hundreds of variables, many with enormous ranges, graphs are not generally that useful in my experience. When I ran this analysis, it was on a computer without a lot of RAM and SAS Enterprise Guide runs better with more memory. So, I deselected the graphs. If you only have a few variables, or if you have 12 GB of RAM and don’t mind scrolling through hundreds of histograms, you may make a different choice.

7

FOUR STEPS FOR FINDING AND FIXING DATA PROBLEMS Step 1 Use PROC SUMMARY, which is the exact same procedure as PROC MEANS but does not produce printed output as a default. The step below will compute the statistics you specify – mean min n std - for all numeric variables in the data set and output these statistics to a data set named achieve_stats

PROC SUMMARY DATA = lib.g8_achieve07 MEAN MIN N STD ; OUTPUT OUT = achieve_stats ; VAR _NUMERIC_ ;

The SUMMARY procedure produces a “wide” data set, with a variable _STAT_ or statistic, and then hundreds of variables, each variable in the data set, and a few rows that are the mean, N and so on.

Step 2 Transpose the data. I want that transposed so I have a few variables and hundreds of observations.

PROC TRANSPOSE DATA = achieve_stats OUT = achieve_stats_trans ; ID _STAT_ ;

The ID variable will provide the names of the variables in the transposed data set.

Step 3 Write a DATA step that reads in the results from the PROC TRANSPOSE and applies rules for selecting problem variables. These could include a set percentage of missing variables, having a negative minimum value, having a standard deviation of zero.

You want to at least look at any variable that has more than 5-10% missing values. Remember, stare at your data. The denominator is the total N for your sample, a number you should have known ever since you did the PROC CONTENTS, right? Some variables may have missing data for a very good reason. For example, father’s education may be unknown if the children don’t live with their father. It can also reflect a poorly-worded question. Students who live with a stepfather may not know if the question means their stepfather or biological father. So they leave the item blank. This is where you check the codebook and see what valid answers are.

Since the data set I am using is of answers to test questions, any negative values are a sign that something is wrong. Even if your data includes variables with legitimate negative values, such as corporate net income, you probably still want to look at the negative values. When people don’t know an answer or want to be emphatic, they sometimes give a figure like -$999,999,999,999 . Unless your data includes the federal deficit, that’s probably not a legitimate answer.

The standard deviation (STD) is a measure of the average amount by which people vary from the mean. If your standard deviation is zero that means everyone gave the same answer. That’s suspicious, so you want to check those variables.

DATA achieve_chk ; SET achieve_stats_trans ; pct_miss = 1 - (N/7377) ; IF MIN < 0 THEN neg_min = 1 ; ELSE neg_min = 0 ; IF STD = 0 THEN constant = 1 ; ELSE constant = 0 ; IF (pctmiss > .05 OR neg_min = 0 OR constant = 1) THEN OUTPUT ; PROC PRINT DATA = achieve_chk ; Step 4: Fix missing values and score questions. The three previous steps would apply to almost any data. This next step is useful for challenges in data sets organized similarly to the example we’re using. Here is our problem; we have data that are not coded in such a way that we can do useful statistical analysis. We can’t get means and compare the percentage of students who answered one question correctly with the percentage correct on another question. Item responses are coded 8 for “item was not administered to student”. For multiple choice items, the numbers 1, 2, 3 or 4 are used if the student answered A, B, C or D and then the format applied to show, for example, A* if the student answered A and it was the right answer. There are several different numbers that all have the format

8

“PARTIAL RESPONSE” which means the student gets partial credit. There are several other variables where the value is 10, 11 or 19. These are items the student received full credit for an item where partial credit was not awarded, for example, for the question is “What is three-cubed?” Then, there are more difficult items which, if answered correctly, should be awarded more than one point.

It’s a very common fix in problems with recoding a large number of items to use an ARRAY statement and a DO-loop. That would work great if all you needed to do was change the items that had an “8” entered to missing. In this particular case however, a simple DO – loop won’t work because all of the variables are not going to be recoded the same way.

You could do one ARRAY and one DO- loop for all of the multiple-choice variables, another for all of the answers requiring computation. That’s a lot of work given that there are over 500 variables and you’d need to go through the codebook, figure out which are the multiple choice items and pull them out one by one. Another possibility you might consider is using the PUT function, which will return the formatted value. It’s coded: PUT(varname, fmt) This would have worked if all of the variables used the same format, but, unfortunately, they used a whole bunch of different formats. What you really need is something like a PUT function but where you don’t need to know the format name. Something that would just return the formatted value, whatever the internally stored format happened to be. Enter the function that does exactly that, the VVALUE function, which I am sure when SAS first added it I thought who imagined SAS needed any more functions and certainly this is not something I would ever use. I was wrong. So, here is the solution DATA lib.math8 ; SET lib.G8_ACHIEVE07 ; ATTRIB fval LENGTH = $10. ; ARRAY rec{*} M022043 -- S042164 ; DO i = 1 TO DIM(rec) ; fval = VVALUE(rec{i}) ; IF fval = "NOT ADMIN." THEN rec{i} = . ; ELSE IF INDEX(fval,"*") > 0 OR fval = "PARTIAL RE" OR rec{i} IN (10,11,19) THEN rec{i} = 1 ; ELSE IF rec{i} > 19 AND rec{i} < 30 THEN rec{i} = 2 ; ELSE rec{i} = 0 ; END ; Let’s look at this line by line. DATA lib.math8 ; SET lib.G8_ACHIEVE07 ; ATTRIB fval LENGTH = $10. ; ARRAY rec{*} M022043 -- S042164 ; This first part is pretty obvious, creating a new data set. The reason for the ATTRIB statement is that some of the values of this new variable are going to be over the default length of 8 characters. You don’t want the equality to fail because the value got truncated. In the ARRAY statement you’re creating an array that has the items from M022043 to S042164 in it. You don’t know exactly how many variables that is so you use an * instead of a number for the dimension of the array, and the dimension will be however many variables there happen to be in that list. DO i = 1 TO DIM(rec) ; The DO statement will do all of the statements between DO and END from when i = 1 to the dimension of the array. The dimension is the number of variables in the array. fval = VVALUE(rec{i}) ; The VVALUE function returns the formatted value. This statement gives the variable fval the formatted value of the ith variable in the array. Creating the variable FVAL saves writing out something like IF INDEX(VVALUE(rec{i},”*”) plus it’s just more efficient programming because you’re not computing that function every time you need the formatted

9

value. The value of fval is written over each time you execute the DO-loop. I need the formatted values to score the items, but I don’t need to save them. IF fval = "NOT ADMIN." THEN rec{i} = . ; For all questions, if it wasn’t administered, you don’t want a 998 or 8 or whatever you’ve got in there. You want it set to missing. So, if the formatted value is NOT ADMIN. It is set to missing. ELSE IF INDEX(fval,"*") > 0 OR fval = "PARTIAL RE" OR rec{i} IN (10,11,19) THEN rec{i} = 1 ; The INDEX function uses the syntax: INDEX(source, string). It searches the source for the substring specified. If the substring is found, the function returns the beginning position of the substring, otherwise, it returns a 0. In the statement above, if an asterisk, denoting a correct response is found, then the item is scored as one. If the formatted value is “PARTIAL RE” (the first ten characters of PARTIAL RESPONSE) or if it is in a list of those values that denotes a correct response for a simple computation, it also gets scored one. ELSE IF rec{i} > 19 AND rec{i} < 30 THEN rec{i} = 2 ; If it is one of those values that denotes a correct answer for one of the more difficult computation question, it is scored a 2. ELSE rec{i} = 0 ; END ; If it is anything else, it is a 0 . With that, the loop is ended.

YOUR MILEAGE MAY VARY: OTHER CHOICES WHEN FIXING DATA PROBLEMS There are a few different choices that could have been made:

If you wanted to keep the formatted values, you would have added an array statement, creating new variables to hold the formatted values . In this case you would need to know the number of variables you were creating and give variable names for the new variables that were to hold the formatted values, for example:

ARRAY fmtval(202) fmtval1 - fmatval202 ;

Next, you would have written a statement to assign formatted values to the each variable in the array. You would replace this:

fval = VVALUE(rec{i}) ;

with this:

fmtval{i} = VVALUE(rec{i}) ;

Danger, Will Robinson!

It’s really important to note that I changed the original values of those variables. Think before you do this. Read this next sentence carefully. Never, never, never, never delete data you can’t recover unless you are really sure that there is no probability greater than your death by drowning in caterpillars that you will never need that data again. And probably not even then. For each of those values, I have no way of knowing if the original value was a 19, 20, 21, or 22. All are now set equal to 2. However, if I really needed to know, I could go back to the data set I read in, the G8_ACHIEVE07 dataset. If all else fails, I can always go back to the TIMSS website and download it again. Another dangerous step I took willingly was this:

Else rec{i} = 0 ; This gives the student a score of zero in every other possible circumstance. I meant to do that. I really did. Whether the student gave the wrong answer, skipped the question or ran out of time and didn’t get that far in the test, it’s 0 points. I knowingly made the assumption that students leave questions blank when they don’t know the answer. However, let’s say the question was income. You cannot assume that people who did not answer have 0 income. Always be careful of the unqualified ELSE statement.

10

TURNING IT INTO A MACRO Now that you have recoded all of the variables, it’s time to run all of the steps on checking quality again. Since you have changed the value of God only knows how many variables and records maybe you do have many more variables that meet that missing value cut-off. Maybe your changes didn’t work. Perhaps there were other variables you needed to recode. Quit whining, you have to redo those steps! You might be thinking that perhaps there is an alternative to repeating the same code over and over. You are correct. Here we begin with your first macro.

OPTIONS MSTORED SASMSTORE = maclib MPRINT ; LIBNAME maclib "C:\Users\AnnMaria\Documents\My SAS Files" ; %MACRO strt(Projdir,dsn,fmts =) / STORE ; DM "CLEAR LOG; CLEAR OUT"; OPTIONS NOFMTERR ; LIBNAME lib "C:\Users\AnnMaria\Documents\&projdir\sasdata"; PROC MEANS DATA = lib.&dsn ; PROC CONTENTS DATA = lib.&dsn ; RUN; %IF %LENGTH(&fmts) = 0 %THEN %GOTO skip ; %INCLUDE "C:\Users\AnnMaria\Documents\&projdir\sascode\&fmts..sas" ; %skip: %PUT You defined the following macro variables ; %PUT _user_ ; %MEND strt ;

The strt macro explained:

OPTIONS MSTORED SASMSTORE = maclib MPRINT ;

If you want to store your macro, you need to use two options in the OPTIONS statement; MSTORED that you want to store the compiled macros, and SASMSTORE = to specify the location where the macros should be stored. Why you need two options, I don’t know. I do know that you have to have both of them because I have tried leaving one out and my program didn’t run and gave me an error. The fact that I have time to do this makes me seriously think I need to take up some kind of hobby. LIBNAME maclib "C:\Users\MyDir\Documents\My SAS Files" ; This statement, obviously, identifies the directory where the stored macros will be uh, stored.

%MACRO strt(Projdir,dsn,fmts =) / STORE ;

This statement gives the macro name, specifies the parameters that may or may not be supplied by the user and adds the option that this macro should be stored. Without the /STORE option your macro is in working memory, just like a data set in the WORK library. Once your SAS session is ended, an unSTORED macro is no more. A word about macro parameters - I’d like to do as little unnecessary work as possible, which means I’d rather not write two macros, but there is a slight complication here. There will always be a data set and a directory in which it is located, but not necessarily always a file of formats. Macro parameters in this form Parmname = are optional. Positional parameters are those not followed by an equal sign and they are not optional.

If you have both optional parameters and positional parameters in your macro, the positional parameters must come first.

So there are two positional parameters and 1 optional DM "CLEAR LOG; CLEAR OUT"; OPTIONS NOFMTERR ; The first two statements in the macro clear the log and output and set the system option to be NOFMTERR. Because this macro is what I do at the start (hence the clever name, strt ) when using a new data set I want to clear out everything and make sure whatever log or output I have showing on the screen came from that data set and nothing else.

11

LIBNAME lib "C:\Users\AnnMaria\Documents\&projdir\sasdata”; PROC MEANS DATA = lib.&dsn ; PROC CONTENTS DATA = lib.&dsn ; RUN; The LIBNAME statement is set that way for a reason. Whenever I start a new project, I create a directory with that project name. I also create three to five subdirectories. One has the sas code and is named sascode. One has the SAS data sets and is named sasdata. One has the output. One has the raw data sets (if they sent it as raw data). One has codebooks, if there were any. If a client comes back to me five years later, all I need to know is the name of the project and I can find just about anything. The PROC MEANS and PROC CONTENTS again are pretty obvious. The &dsn refers to whatever value is specified for the variable dsn when the macro is called. It will plug that value into the PROC MEANS statement and PROC CONTENTS statement. I do need that RUN statement in there, though. Without it, my PROC CONTENTS won’t execute. %IF %LENGTH(&fmts) = 0 %THEN %GOTO skip ; If there was no parameter for fmts, that is, if the length = 0, then go to the label skip. Notice two things here

1. The length function is a macro function so you need %length 2. In the GOTO statement, skip does not have a %. If I included that the program would look for a macro

named %skip. %INCLUDE "C:\Users\AnnMaria\Documents\&projdir\sascode\&fmts..sas" ;

You need two dots in the file name &fmts..sas because the first one will just be interpreted as the separator between the macro parameter and the next word. Without the two periods, SAS would look for a macro variable named fmtssas , not find it and give you an error. %skip: Since skip is a label, it needs a colon after it. I always feel weird not ending something in SAS with a semi-colon, but there it is . All %skip: does is label the location in the program to go to. %PUT You defined the following macro variables ; %PUT _user_ ; These two statements just write to the SAS log, first the statement “You defined the following macro variables and then, following that, all of the macro variables. %MEND strt ; Calling the macro

LIBNAME maclib "C:\Users\AnnMaria\Documents\My SAS files" ; OPTIONS MSTORED SASMSTORE = maclib ; %strt(TIMSS,G8_STUDENT07,fmts = studentformats) ;

So, my output is the results of the means and contents procedures and I have this in my SAS log. You defined the following macro variables

STRT FMTS studentformats STRT DSN G8_STUDENT07 STRT PROJDIR TIMSS

OR If I didn’t have a formats file, I would just leave that parameter out when I call the macro

%strt(TIMSS,G8_STUDENT07) ; and I’d have this in my SAS log:

You defined the following macro variables STRT DSN G8_STUDENT07 STRT PROJDIR TIMSS

12

YOUR SECOND MACRO Now that you have macros down, it’s time for your second macro. The LIBNAME statement is not included here because it was in the macro above. There is no need to define it again if you are using both in the same interactive SAS session. Alternatively, if you had run the strt macro at a different time or (gasp!) not at all, you could just have the LIBNAME statement as the first statement in your program followed by the statement that calls the data quality macro.

Here is the macro to do the data quality checks.

%MACRO dataqual(dsn,idvar,obsnum) / STORE ; There are five macro parameters and all of them are required, so I made them all positional parameters.

TITLE “Duplicate ID Numbers” ; PROC FREQ DATA = lib.&dsn noprint ; TABLES &idvar / OUT = &dsn._freq (WHERE = ( COUNT > 1 )) ; FORMAT &idvar ;

This is the check for duplicate id values, using the id variable that will be specified when the macro is called. It will create an output data set named the same as the input data set with _freq appended to the name. Note that you need the . before the _freq or SAS will look for a macro parameter named dsn_freq and it won’t find it. The “.” at the end of the macro parameter denotes the end of the macro parameter name.

PROC PRINT DATA = &dsn._freq (OBS = 10 ) ;

This prints out the first ten duplicate id values. PROC SUMMARY DATA = lib.&dsn MEAN MIN N STD ; OUTPUT OUT = &dsn._stats ; VAR _numeric_ ;

This creates the output dataset of means, minimum, number of observations and standard deviation for all of the numeric variables in the data set. The variables are output to a data set named the same as the input data set with _stats appended.

PROC TRANSPOSE DATA = &dsn._stats OUT = &dsn._stats_trans ; ID _STAT_ ;

This transposes the data set created by PROC SUMMARY and outputs the transpose to a data set with _trans appended to the name.

DATA &dsn._chk ; SET &dsn._stats_trans ;

This creates a new data set with the same name as your input data set with _chk appended. The transposed dataset is read in. pctmiss = 1 – (N/&obsnum) ; The value of pctmiss is 1 - the number of observations for that variable divided by the total number of observations in the data set, provided by the &obsnum parameter.

IF MIN < 0 THEN neg_min = 1 ; ELSE neg_min = 0 ; IF std = 0 THEN constant = 1 ; ELSE constant = 0 ; IF (pctmiss > .05 OR neg_min = 1 OR constant = 1) THEN OUTPUT ;

These are the exact same statements as in the previous, non-macro section on data quality, selecting out the variables with negative minimum, constant values and more than 5% of the data missing.

TITLE “Deviant variables to check ” ; PROC PRINT DATA = &dsn._chk ; RUN;

This prints the deviant variables to check. By this point, if everything went well in the previous step, you’ll find a huge number, because there are many items that are not administered.

13

TITLE “First 10 observations with ALL of the variables unformatted ” ; PROC PRINT DATA = lib.&dsn (OBS= 10) ; FORMAT _ALL_ ; RUN ;

This does the same step you did earlier with printing out the unformatted values of the first ten records. It’s the same statements with the only change being putting &dsn in place of the actual data set name.

TITLE “First 10 observations with ALL of the variables formatted ” ; PROC PRINT DATA = lib.&dsn (obs= 10) ; RUN ;

This does the same step you did earlier with printing out the formatted values of the first ten records. Again, it’s the same statements with the only change being putting &dsn in place of the actual data set name.

TITLE “Deviant variables to drop “ ; PROC PRINT DATA = &dsn._chk noobs; VAR _NAME_ ;

Why the last PROC PRINT? Seriously, you don’t think I’m going to type 300+ variable names do you? I’m a terrible typist. Beside, that’s too much effort. After I look through the results above, I will usually decide to delete most of the variables. There are usually a few I want to keep – for example, father’s education might be missing for more than 5% of the sample but that’s because some people don’t live with their fathers and may not have the data, so I might keep that variable despite missing data. Besides, 5% is a low threshold, so I may keep variables that are missing 10% of the data. For some variables like income, a negative number can be correct. After examining the results, I can copy and paste the whole list of variables into a DROP statement, and just delete the few I want to keep. The NOOBS option prevents the observation number from printing next to it so don’t need to do any editing after I copy it and paste it. A partial listing of the output is shown below.

_NAME_ IDCNTRY M042273 S022106 S042238C S042311 S042401 ITLANG

%PUT You defined the following macro variables ; %PUT _USER_ ; RUN ; %MEND dataqual ;

This is exactly like our previous macro (it’s starting to get familiar now, isn’t it?). This will put the values for the user-defined macro variables to our SAS log, which helps with de-bugging. It is not required to include these statements and if you never have any errors in you programs, you won’t need them. The last statement ends the macro.

To call the macro, I do this:

%dataqual(math8,idstud,7377)

NEXT STEPS: MERGING THE FILES Remember there were six files downloaded? That means that I am going to use these macros that I have now created multiple times. With each file, I will do all of these data quality checks, clean up my data and finally have six lovely files. Now what? Logically, I will want to merge some of them together. The TIMSS data are fairly small files with only 7,377 records and 500 or so variables in each file. Still, just for practice maybe I want to throw in a PROC SQL to make the merging faster. In this case, I want to merge my high-quality achievement data set with all the variables scored nicely with my data set on student interests. You could, of course, do a PROC SORT on both data sets and then use a DATA step, MERGE and BY statements to merge the two. On my desktop, that took .96 seconds. I have the patience of a guppy, though, so I want to do PROC SQL

14

PROC SQL ; CREATE TABLE mrgdat AS SELECT * FROM lib.math8 s , lib.student_int d WHERE s.idstud = d.idstud ;

There are only two statements here. The first begins the SQL procedure. The second creates a temporary SAS dataset mrgdat as a combination of the variables selected from the two data sets listed. The “*” selects all of the variables from the two data sets. The s after the first data set is the identifier for that data set. Similarly, the d will be the reference for the lib.student_int data set. Notice the “,” in between the two data sets. The WHERE clause specifies that records will only be selected where the value of idstud for the first data set equal the value of idstud for the second data set. Note that unlike in other SAS procedures, the WHERE is not a separate statement. There is no semi-colon from CREATE until the end.

It’s been said that people either love PROC SQL or they hate it. If there is a forced choice, I am definitely in the latter category. They’re not fooling anybody, SQL really isn’t SAS, it’s a whole new language. Still, there are times when it is useful. Notice that no SORT is required before the merge. In this particular case it does not make much difference, but some open data sets are millions of records and the speed advantage is noticeable.

CONCLUSION What did we learn? One thing you hopefully learned was some ideas on how to organize a large project, including procedures for data cleaning, setting up a project directory with code and data within subdirectories, writing macros and faster merging for large data sets. Designing and implementing your own research project is valuable experience for a junior programmer because, to be quite frank, you’re not going to get the opportunity to practice on a major project that the university grant funding or corporate marketing is dependent upon.

What about all the do-gooder doing good? All we’ve done so far is read in the data, fix the data, and write a macro to examine and fix the data, then merge data sets together. Exactly! There are two other SAS Essentials sections that discuss graphs, tables and statistics with open data, including presentations for urban middle school math and social studies programs (De Mars, 2011a, De Mars, 2011b).

The first thing you learn is that any real data analysis project is 80% or more getting the data into shape. Statistics textbooks and clas examples generally provide students with small, tidy data sets ready for use. That is a poor preparation for life as a programmer. If you learn nothing else from this paper, remember this, always, always check your data quality very carefully before you do anything else. This is not the cool, fun part where you create analyses to impress your friends or beautiful graphics to put up on your website. This is the part where you don’t screw up by using the wrong data or interpreting the data in the wrong way. It’s a thankless task and often no one will even notice that you did it. Odds are great, though, that it will come back to haunt you if you don’t.

The TIMSS is a great example for this because most of the items are not administered. There is a very good reason for this. In very brief, you cannot ask eighth-graders to take a test of several hundred items, but you also can’t fairly test how well a country is doing relative to other nations based on a 25-question test. The solution then, is to select a large number of students and give each of them 50 of the questions. Not knowing that over 80% of the values of every test item are an “8” on a 1 to 5 scale and not correcting for that before you analyze the data will cause your results to be wildly wrong.

REFERENCES De Mars, A. (2011a). SAS Essentials II: Better-looking SAS for a better community. Paper presented at the annual meeting of the Western Users of SAS software.

De Mars, A. (2011b). Statistics for hamsters. Paper presented at the annual meeting of the Western Users of SAS software.

ACKNOWLEDGMENTS Thank you to Dr. Peter Flom for reviewing an earlier draft of this paper.

15

CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at:

AnnMaria De Mars The Julia Group 2111 7th St. #8 Santa Monica, CA 90405 (310) 717-9089 [email protected] http://www.thejuliagroup.com

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.