Overhead 8

7/30/2019 Overhead 8

1/9

STA4133/5133.01

Page 1 of 9

Data Sets: Subsetting, Combining and Updating

Subsetting Datasets

Using the SET statement with the IF, you can easily create a subset of an existing SAS dataset.

Consider the following example.

An amusement park is collecting data about their train ride. The data below include the time ofday, the number of cars on the train, and the number of people on the train. Suppose at some point

there is to be an analysis of the afternoon rides only. Create a subset of the trains dataset which

includes only the afternoon train rides.

DATAtrains;

INPUT Time TIME5. Cars People;

DATALINES;

10:10 6 21

12:15 10 5615:30 10 25

11:30 8 34

13:15 8 12

10:45 6 13

20:30 6 32

23:15 6 12

;

RUN;

DATAafternoon;

SET trains;IF time>'12:00'T; /*keep only obs with times after noon*/

RUN;

PROC PRINT;

FORMAT time TIME5.;

RUN;

Notice that the statementIF time>'12:00'T;

is essentially the same as writing

if time>'12:00'TTHENOUTPUT;

or, equivalently,

if time>'12:00'TTHENOUTPUT afternoon;


2/9

STA4133/5133.01

Page 2 of 9

Combining Datasets

You may find that the variables that are needed for your analysis are coming from separatedatasets. It may also be the case that your complete set of records are coming from separate files.

In either case, you will need to combine SAS datasets, either by stacking, interleaving or merging

them.

For example, longitudinal studies are usually stored in annual files. In order to perform cross year

studies you will want to create a single dataset which contains only the variables you need for your

analysis from all years of the study. This may require stacking or interleaving existing datasets,while renaming and dropping variables.

Using the SET statement

Stacking SAS datasets

Any number of SAS datasets can be stacked, or concatenated, using the SET statement. The syntax

is the same, although you will be specifying two or more datasets in the SET statement:

DATA new;

SET dataset1 dataset2 ... datasetn;

Some important notes:

The input datasets do not have to be sorted, and in most cases the output dataset will not besorted.

The total number of records resulting from the SET statement will be the sum of records in eachinput dataset.

The total number of variables resulting will depend on the number of matching variables. If oneof the datasets has a variable that is not in another dataset, the observations from that otherdataset will have missing values for that variable.

Example. Consider the datasets masterand test. The master dataset contains social security numberand name, and the test dataset contains social security number and a test score. What is a

reasonable way to combine these datasets? It would make sense to match score to name by social

security. We will do this later. For now, let us see what happens if we use the SET statement, and,in fact, these datasets are concatenated.

***Programs to create data sets MASTER and TEST;

DATA MASTER;

INPUT SS NAME : $9.;

DATALINES;

123456789 CODY

987654321 SMITH

111223333 GREGORY

222334444 HAMER


3/9

STA4133/5133.01

Page 3 of 9

777665555 CHAMBLISS

;

DATA TEST;

INPUT SS SCORE;

DATALINES;

123456789 100

987654321 67

222334444 92

;

DATA combine;

SET master test;

run;

Examine the output from this dataset using the PROC PRINT statement. Now switch the input

datasets and compare the outputs.

Interleaving SAS datasets

When datasets are concatenated as in the above example, records will be ordered simply by how

they arrive. The records from the first listed dataset in the SET statement will be first, and they will

be ordered according to the input dataset. Records from the dataset listed last in the SET statementwill be last, and ordered according to the input dataset. In most cases this results in an unsorted

dataset, even if the input datasets are ordered.

When two input datasets are ordered by some variable and you wish the combined dataset to be

ordered by that same variable, it is more efficient to interleave the datasets (than to stack them and

then sort them).

The syntax for interleaving datasets is the same as that for stacking, with the addition of a BY

statement:

DATA new;

SET dataset1 dataset2 ... datasetn;

BY variable1 ... variablen;

Some important notes:

The input datasets be sorted by the same variables in the BY statement above. The output dataset will be sorted according to the BY statement. The total number of records resulting from the SET statement will be the sum of records in each

input dataset.

The total number of variables resulting will depend on the number of matching variables. If oneof the datasets has a variable that is not in another dataset, the observations from that otherdataset will have missing values for that variable.


4/9

STA4133/5133.01

Page 4 of 9

Example. Interleave the datasets masterand test. Since these datasets have only one variable in

common social security number the interleaving will be by that variable. The datasets are notsorted by this variable, so this should be done before the interleaving.

PROC SORT DATA=master;

BY ss;PROC SORT DATA=test;

BY ss;

DATA combine;

SET master test;

BY ss;

run;

Examine the output from this dataset using the PROC PRINT statement.

Using theMERGE statement

Combining datasets may involve matching on a particular variable. When this is the case, theMERGE statement should be used instead of the SET statement, for specifying input datasets.

The general format of a MERGE statement is:

DATA new;

MERGE dataset1 dataset2;

BY variable1 variable2 ... variablen;Run;

Important

If the two input datasets have variables with the same name (other than the BY variables) thevalue of the variables from the second dataset will overwrite the value of the variables from the

first dataset. Use the RENAME= option to assign new names to the variables in one of the

datasets.

The two input datasets MUST be sorted in order of the BY variables, and the (BY) variablesmust have identical names in both datasets. If they do not have the same name, use the

RENAME= option.

There are a variety of ways to merge two datasets using PROC MERGE:

1. One-to-one match merge, with a BY value2. Non-matches with a by value3. Limiting observations by using the IN= option4. One-to-many match merge

One-to-one match merge with a BY value


5/9

STA4133/5133.01

Page 5 of 9

The following example merges a file containing demographic data and a file of health statistics into

a single file called pophlth. In this case both datasets have the exact same zip codes but differentvariables.

The file of demographic data has the following observations:

zip popsize hhsize

78201 111111 12121278202 222222 343434

78203 333333 565656

The health file has the following observations:

zip diab access

78201 .04 .878202 .05 .7

78203 .06 .6

The files will be merged together by zip code, and will contain ALL variables in both input files.

/*Always remember to sort the input datasets*/

PROC SORT DATA=demog;

BY zip;

PROC SORT DATA=health;

BY zip;

DATA pophlth;

MERGE demog /* demography file */

health; /* health stat. file */

BY zip;

RUN;

The new dataset pophlth will have the following observations:

zip popsize hhsize diab heart78201 111111 121212 .04 .8

78202 222222 343434 .05 .7

78203 333333 565656 .06 .6

Note: Suppose the variable for zip code is named differently for the two input files. If the health

dataset uses the name zcode, for example, you can use the RENAME= option to change it:Merge demog

health (RENAME = (zcode = zip));

Non-matches with a BY value


6/9

STA4133/5133.01

Page 6 of 9

Suppose in the previous example that the health dataset has a zip code, say 78204, that is not in thedataset demog. The MERGE statement will keep all values for the health variables and set the

demographic variables to missing.

As an extreme case, consider the case where no zip codes match. Consider a second health datasetwith the following observations:

zip diab access

78214 .04 .7

78215 .03 .878216 .02 .9

The same code used previously will generate a new dataset pophlth with the following

observations:

zip popsize hhsize diab heart78201 111111 121212 . .78202 222222 343434 . .

78203 333333 565656 . .

78214 . . .04 .778215 . . .03 .8

78216 . . .02 .9

Limiting observations by using the IN= option

In the previous examples, all of the input records from both files belonged in the new output

dataset. With the IN= option, you can specify which observations you wish to keep.

Suppose the demog dataset that we used earlier is a subset of a much larger dataset with containsdemographic data for all US zip codes. Lets call this dataset Usdemog. We can merge this dataset

with the health dataset by zip code, but we may not wish to keep all the observations with zip codes

for which there is no health data. Consider the following code:

DATA pophlth;

MERGE demog (IN=indem)

health;

BY zip;

IF indem THEN OUTPUT;

RUN;

The IN= option specifies the temporary variable indem, which is set to true only for observations

for which the demog file contributed to the merge. The output file will contain an observation for


7/9

STA4133/5133.01

Page 7 of 9

the three zip codes in the demog dataset, along with the associated variables from the health file. If

there was not a matching zip code in the health file, then the health variables will be set to missing.

If you wanted to output records only if there was a demography AND a health record for the same

zip code you would use the following:

DATA pophlth;


health (IN=inhlth);

BY zip;

IF indem AND inhlth THEN OUTPUT;

RUN;

You can take advantage of the IN= option to create several output datasets in one data step.

Suppose you are merging the same files (Usdemog and health) and you wish to keep a dataset ofrecords that matched as well as a dataset of records that did not match. The code can easily be

modified to accomplish this:

DATA phmatch /*records that matched*/

unm_hlth /*health records with no match in demog*/

unm_dem; /*demog records with no match in health*/


health (IN=inhlth);

BY zip;

IF indem THEN DO;

IF inhlth THEN OUTPUT phmatch;

ELSE OUTPUT unm_dem;

END;

ELSE IF inhlth THEN OUTPUT unm_hlth;RUN;

One-to-Many match merge

Sometimes you want to merge several observations in one data set with a single observation in

another data set. For example, suppose you have data by zip code and you need to get stateinformation for those zip codes. Since each state has multiple zip codes, you will need a one-to-

many merge.

The difference between this type of merge and the one-to-one merge is not in the code, but in the

data. For this reason, it is important to know your datasets before you merge them! If, in fact, yourdatasets are not in a one-to-many format, but in a many-to-many format, your new dataset may not

be what you expect it to be!

Consider the following demographic dataset, which has a state identifier:

Zip zipsize state27703 111111 12


8/9

STA4133/5133.01

Page 8 of 9

78202 222222 28

78203 333333 28

Consider the following dataset with access to care information, by state:

state access12 0.7

23 0.828 0.6

The following code will perform the one-to-many merge:

DATA new;

MERGE demog access;

BY state;

RUN;

The new Data Set will contain the following observations:

zip zipsize state access27703 111111 12 0.7

. . 23 0.8

78202 222222 28 0.678203 333333 28 0.6

Updating Datasets

You may have a master dataset that need to be updated periodically. Study data is an example of

data that need to be updated with corrections, follow-up data, or other new information.

The general form of the UPDATE statement is the same as for the MERGE statement, except it

only specifies two input datasets:

DATA master;

UPDATE master transactions;

BY variable1 ... variablen;

RUN;

Important notes:

Input datasets must be sorted according to the BY statement. The output dataset will be sorted. The values of the BY variables must be unique in the master dataset (but not necessarily in the

transaction/updates dataset).

Missing values in the transaction dataset DO NOT replace existing values in the master dataset.


9/9

STA4133/5133.01

Page 9 of 9

Example. (B 6.8)A hospital maintains a master database with patient information. Each record contains the patients

account number, last name, address, date of birth, sex, insurance code, and the date that patients

information was last updated. Whenever a patient is admitted to the hospital, a transaction record is

created, containing new information and status changes. Some of the patients are new (and are notyet on the master database.)

The code below illustrates how the master dataset is updated:

DATAmaster;

INPUT Account LastName $ 8-16 Address $ 17-34

BirthDate MMDDYY10. Sex $ InsCode $ 48-50 @52 LastUpdate

MMDDYY10.;

DATALINES;

620135 Smith 234 Aspen St. 12-21-1975 m CBC 02-16-1998

645722 Miyamoto 65 3rd Ave. 04-03-1936 f MCR 05-30-1999

645739 Jensvold 505 Glendale Ave. 06-15-1960 f HLT 09-23-1993

874329 Kazoyan 76-C La Vista . . MCD 01-15-2003

;

RUN;

DATAtransactions;

INPUT Account LastName $ 8-16 Address $ 17-34 BirthDate MMDDYY10.

Sex $ InsCode $ 48-50 @52 LastUpdate MMDDYY10.;

DATALINES;

620135 . . . . HLT 06-15-2003

874329 . . 04-24-1954 m . 06-15-2003

235777 Harman 5656 Land Way 01-18-2000 f MCD 06-15-2003;

RUN;

PROCSORTDATA = transactions;

BY Account;

* Update patient data with transactions;

DATAmaster;

UPDATE master transactions;

BY Account;

RUN;

PROCPRINTDATA = master;

FORMAT BirthDate LastUpdate MMDDYY10.;

TITLE'Admissions Data';

RUN;

Overhead 8

Documents

Transcript of Overhead 8