Overhead 8
Transcript of Overhead 8
-
7/30/2019 Overhead 8
1/9
STA4133/5133.01
Page 1 of 9
Data Sets: Subsetting, Combining and Updating
Subsetting Datasets
Using the SET statement with the IF, you can easily create a subset of an existing SAS dataset.
Consider the following example.
An amusement park is collecting data about their train ride. The data below include the time ofday, the number of cars on the train, and the number of people on the train. Suppose at some point
there is to be an analysis of the afternoon rides only. Create a subset of the trains dataset which
includes only the afternoon train rides.
DATAtrains;
INPUT Time TIME5. Cars People;
DATALINES;
10:10 6 21
12:15 10 5615:30 10 25
11:30 8 34
13:15 8 12
10:45 6 13
20:30 6 32
23:15 6 12
;
RUN;
DATAafternoon;
SET trains;IF time>'12:00'T; /*keep only obs with times after noon*/
RUN;
PROC PRINT;
FORMAT time TIME5.;
RUN;
Notice that the statementIF time>'12:00'T;
is essentially the same as writing
if time>'12:00'TTHENOUTPUT;
or, equivalently,
if time>'12:00'TTHENOUTPUT afternoon;
-
7/30/2019 Overhead 8
2/9
STA4133/5133.01
Page 2 of 9
Combining Datasets
You may find that the variables that are needed for your analysis are coming from separatedatasets. It may also be the case that your complete set of records are coming from separate files.
In either case, you will need to combine SAS datasets, either by stacking, interleaving or merging
them.
For example, longitudinal studies are usually stored in annual files. In order to perform cross year
studies you will want to create a single dataset which contains only the variables you need for your
analysis from all years of the study. This may require stacking or interleaving existing datasets,while renaming and dropping variables.
Using the SET statement
Stacking SAS datasets
Any number of SAS datasets can be stacked, or concatenated, using the SET statement. The syntax
is the same, although you will be specifying two or more datasets in the SET statement:
DATA new;
SET dataset1 dataset2 ... datasetn;
Some important notes:
The input datasets do not have to be sorted, and in most cases the output dataset will not besorted.
The total number of records resulting from the SET statement will be the sum of records in eachinput dataset.
The total number of variables resulting will depend on the number of matching variables. If oneof the datasets has a variable that is not in another dataset, the observations from that otherdataset will have missing values for that variable.
Example. Consider the datasets masterand test. The master dataset contains social security numberand name, and the test dataset contains social security number and a test score. What is a
reasonable way to combine these datasets? It would make sense to match score to name by social
security. We will do this later. For now, let us see what happens if we use the SET statement, and,in fact, these datasets are concatenated.
***Programs to create data sets MASTER and TEST;
DATA MASTER;
INPUT SS NAME : $9.;
DATALINES;
123456789 CODY
987654321 SMITH
111223333 GREGORY
222334444 HAMER
-
7/30/2019 Overhead 8
3/9
STA4133/5133.01
Page 3 of 9
777665555 CHAMBLISS
;
DATA TEST;
INPUT SS SCORE;
DATALINES;
123456789 100
987654321 67
222334444 92
;
DATA combine;
SET master test;
run;
Examine the output from this dataset using the PROC PRINT statement. Now switch the input
datasets and compare the outputs.
Interleaving SAS datasets
When datasets are concatenated as in the above example, records will be ordered simply by how
they arrive. The records from the first listed dataset in the SET statement will be first, and they will
be ordered according to the input dataset. Records from the dataset listed last in the SET statementwill be last, and ordered according to the input dataset. In most cases this results in an unsorted
dataset, even if the input datasets are ordered.
When two input datasets are ordered by some variable and you wish the combined dataset to be
ordered by that same variable, it is more efficient to interleave the datasets (than to stack them and
then sort them).
The syntax for interleaving datasets is the same as that for stacking, with the addition of a BY
statement:
DATA new;
SET dataset1 dataset2 ... datasetn;
BY variable1 ... variablen;
Some important notes:
The input datasets be sorted by the same variables in the BY statement above. The output dataset will be sorted according to the BY statement. The total number of records resulting from the SET statement will be the sum of records in each
input dataset.
The total number of variables resulting will depend on the number of matching variables. If oneof the datasets has a variable that is not in another dataset, the observations from that otherdataset will have missing values for that variable.
-
7/30/2019 Overhead 8
4/9
STA4133/5133.01
Page 4 of 9
Example. Interleave the datasets masterand test. Since these datasets have only one variable in
common social security number the interleaving will be by that variable. The datasets are notsorted by this variable, so this should be done before the interleaving.
PROC SORT DATA=master;
BY ss;PROC SORT DATA=test;
BY ss;
DATA combine;
SET master test;
BY ss;
run;
Examine the output from this dataset using the PROC PRINT statement.
Using theMERGE statement
Combining datasets may involve matching on a particular variable. When this is the case, theMERGE statement should be used instead of the SET statement, for specifying input datasets.
The general format of a MERGE statement is:
DATA new;
MERGE dataset1 dataset2;
BY variable1 variable2 ... variablen;Run;
Important
If the two input datasets have variables with the same name (other than the BY variables) thevalue of the variables from the second dataset will overwrite the value of the variables from the
first dataset. Use the RENAME= option to assign new names to the variables in one of the
datasets.
The two input datasets MUST be sorted in order of the BY variables, and the (BY) variablesmust have identical names in both datasets. If they do not have the same name, use the
RENAME= option.
There are a variety of ways to merge two datasets using PROC MERGE:
1. One-to-one match merge, with a BY value2. Non-matches with a by value3. Limiting observations by using the IN= option4. One-to-many match merge
One-to-one match merge with a BY value
-
7/30/2019 Overhead 8
5/9
STA4133/5133.01
Page 5 of 9
The following example merges a file containing demographic data and a file of health statistics into
a single file called pophlth. In this case both datasets have the exact same zip codes but differentvariables.
The file of demographic data has the following observations:
zip popsize hhsize
78201 111111 12121278202 222222 343434
78203 333333 565656
The health file has the following observations:
zip diab access
78201 .04 .878202 .05 .7
78203 .06 .6
The files will be merged together by zip code, and will contain ALL variables in both input files.
/*Always remember to sort the input datasets*/
PROC SORT DATA=demog;
BY zip;
PROC SORT DATA=health;
BY zip;
DATA pophlth;
MERGE demog /* demography file */
health; /* health stat. file */
BY zip;
RUN;
The new dataset pophlth will have the following observations:
zip popsize hhsize diab heart78201 111111 121212 .04 .8
78202 222222 343434 .05 .7
78203 333333 565656 .06 .6
Note: Suppose the variable for zip code is named differently for the two input files. If the health
dataset uses the name zcode, for example, you can use the RENAME= option to change it:Merge demog
health (RENAME = (zcode = zip));
Non-matches with a BY value
-
7/30/2019 Overhead 8
6/9
STA4133/5133.01
Page 6 of 9
Suppose in the previous example that the health dataset has a zip code, say 78204, that is not in thedataset demog. The MERGE statement will keep all values for the health variables and set the
demographic variables to missing.
As an extreme case, consider the case where no zip codes match. Consider a second health datasetwith the following observations:
zip diab access
78214 .04 .7
78215 .03 .878216 .02 .9
The same code used previously will generate a new dataset pophlth with the following
observations:
zip popsize hhsize diab heart78201 111111 121212 . .78202 222222 343434 . .
78203 333333 565656 . .
78214 . . .04 .778215 . . .03 .8
78216 . . .02 .9
Limiting observations by using the IN= option
In the previous examples, all of the input records from both files belonged in the new output
dataset. With the IN= option, you can specify which observations you wish to keep.
Suppose the demog dataset that we used earlier is a subset of a much larger dataset with containsdemographic data for all US zip codes. Lets call this dataset Usdemog. We can merge this dataset
with the health dataset by zip code, but we may not wish to keep all the observations with zip codes
for which there is no health data. Consider the following code:
DATA pophlth;
MERGE demog (IN=indem)
health;
BY zip;
IF indem THEN OUTPUT;
RUN;
The IN= option specifies the temporary variable indem, which is set to true only for observations
for which the demog file contributed to the merge. The output file will contain an observation for
-
7/30/2019 Overhead 8
7/9
STA4133/5133.01
Page 7 of 9
the three zip codes in the demog dataset, along with the associated variables from the health file. If
there was not a matching zip code in the health file, then the health variables will be set to missing.
If you wanted to output records only if there was a demography AND a health record for the same
zip code you would use the following:
DATA pophlth;
MERGE demog (IN=indem)
health (IN=inhlth);
BY zip;
IF indem AND inhlth THEN OUTPUT;
RUN;
You can take advantage of the IN= option to create several output datasets in one data step.
Suppose you are merging the same files (Usdemog and health) and you wish to keep a dataset ofrecords that matched as well as a dataset of records that did not match. The code can easily be
modified to accomplish this:
DATA phmatch /*records that matched*/
unm_hlth /*health records with no match in demog*/
unm_dem; /*demog records with no match in health*/
MERGE demog (IN=indem)
health (IN=inhlth);
BY zip;
IF indem THEN DO;
IF inhlth THEN OUTPUT phmatch;
ELSE OUTPUT unm_dem;
END;
ELSE IF inhlth THEN OUTPUT unm_hlth;RUN;
One-to-Many match merge
Sometimes you want to merge several observations in one data set with a single observation in
another data set. For example, suppose you have data by zip code and you need to get stateinformation for those zip codes. Since each state has multiple zip codes, you will need a one-to-
many merge.
The difference between this type of merge and the one-to-one merge is not in the code, but in the
data. For this reason, it is important to know your datasets before you merge them! If, in fact, yourdatasets are not in a one-to-many format, but in a many-to-many format, your new dataset may not
be what you expect it to be!
Consider the following demographic dataset, which has a state identifier:
Zip zipsize state27703 111111 12
-
7/30/2019 Overhead 8
8/9
STA4133/5133.01
Page 8 of 9
78202 222222 28
78203 333333 28
Consider the following dataset with access to care information, by state:
state access12 0.7
23 0.828 0.6
The following code will perform the one-to-many merge:
DATA new;
MERGE demog access;
BY state;
RUN;
The new Data Set will contain the following observations:
zip zipsize state access27703 111111 12 0.7
. . 23 0.8
78202 222222 28 0.678203 333333 28 0.6
Updating Datasets
You may have a master dataset that need to be updated periodically. Study data is an example of
data that need to be updated with corrections, follow-up data, or other new information.
The general form of the UPDATE statement is the same as for the MERGE statement, except it
only specifies two input datasets:
DATA master;
UPDATE master transactions;
BY variable1 ... variablen;
RUN;
Important notes:
Input datasets must be sorted according to the BY statement. The output dataset will be sorted. The values of the BY variables must be unique in the master dataset (but not necessarily in the
transaction/updates dataset).
Missing values in the transaction dataset DO NOT replace existing values in the master dataset.
-
7/30/2019 Overhead 8
9/9
STA4133/5133.01
Page 9 of 9
Example. (B 6.8)A hospital maintains a master database with patient information. Each record contains the patients
account number, last name, address, date of birth, sex, insurance code, and the date that patients
information was last updated. Whenever a patient is admitted to the hospital, a transaction record is
created, containing new information and status changes. Some of the patients are new (and are notyet on the master database.)
The code below illustrates how the master dataset is updated:
DATAmaster;
INPUT Account LastName $ 8-16 Address $ 17-34
BirthDate MMDDYY10. Sex $ InsCode $ 48-50 @52 LastUpdate
MMDDYY10.;
DATALINES;
620135 Smith 234 Aspen St. 12-21-1975 m CBC 02-16-1998
645722 Miyamoto 65 3rd Ave. 04-03-1936 f MCR 05-30-1999
645739 Jensvold 505 Glendale Ave. 06-15-1960 f HLT 09-23-1993
874329 Kazoyan 76-C La Vista . . MCD 01-15-2003
;
RUN;
DATAtransactions;
INPUT Account LastName $ 8-16 Address $ 17-34 BirthDate MMDDYY10.
Sex $ InsCode $ 48-50 @52 LastUpdate MMDDYY10.;
DATALINES;
620135 . . . . HLT 06-15-2003
874329 . . 04-24-1954 m . 06-15-2003
235777 Harman 5656 Land Way 01-18-2000 f MCD 06-15-2003;
RUN;
PROCSORTDATA = transactions;
BY Account;
* Update patient data with transactions;
DATAmaster;
UPDATE master transactions;
BY Account;
RUN;
PROCPRINTDATA = master;
FORMAT BirthDate LastUpdate MMDDYY10.;
TITLE'Admissions Data';
RUN;