Overhead 8

download Overhead 8

of 9

Transcript of Overhead 8

  • 7/30/2019 Overhead 8

    1/9

    STA4133/5133.01

    Page 1 of 9

    Data Sets: Subsetting, Combining and Updating

    Subsetting Datasets

    Using the SET statement with the IF, you can easily create a subset of an existing SAS dataset.

    Consider the following example.

    An amusement park is collecting data about their train ride. The data below include the time ofday, the number of cars on the train, and the number of people on the train. Suppose at some point

    there is to be an analysis of the afternoon rides only. Create a subset of the trains dataset which

    includes only the afternoon train rides.

    DATAtrains;

    INPUT Time TIME5. Cars People;

    DATALINES;

    10:10 6 21

    12:15 10 5615:30 10 25

    11:30 8 34

    13:15 8 12

    10:45 6 13

    20:30 6 32

    23:15 6 12

    ;

    RUN;

    DATAafternoon;

    SET trains;IF time>'12:00'T; /*keep only obs with times after noon*/

    RUN;

    PROC PRINT;

    FORMAT time TIME5.;

    RUN;

    Notice that the statementIF time>'12:00'T;

    is essentially the same as writing

    if time>'12:00'TTHENOUTPUT;

    or, equivalently,

    if time>'12:00'TTHENOUTPUT afternoon;

  • 7/30/2019 Overhead 8

    2/9

    STA4133/5133.01

    Page 2 of 9

    Combining Datasets

    You may find that the variables that are needed for your analysis are coming from separatedatasets. It may also be the case that your complete set of records are coming from separate files.

    In either case, you will need to combine SAS datasets, either by stacking, interleaving or merging

    them.

    For example, longitudinal studies are usually stored in annual files. In order to perform cross year

    studies you will want to create a single dataset which contains only the variables you need for your

    analysis from all years of the study. This may require stacking or interleaving existing datasets,while renaming and dropping variables.

    Using the SET statement

    Stacking SAS datasets

    Any number of SAS datasets can be stacked, or concatenated, using the SET statement. The syntax

    is the same, although you will be specifying two or more datasets in the SET statement:

    DATA new;

    SET dataset1 dataset2 ... datasetn;

    Some important notes:

    The input datasets do not have to be sorted, and in most cases the output dataset will not besorted.

    The total number of records resulting from the SET statement will be the sum of records in eachinput dataset.

    The total number of variables resulting will depend on the number of matching variables. If oneof the datasets has a variable that is not in another dataset, the observations from that otherdataset will have missing values for that variable.

    Example. Consider the datasets masterand test. The master dataset contains social security numberand name, and the test dataset contains social security number and a test score. What is a

    reasonable way to combine these datasets? It would make sense to match score to name by social

    security. We will do this later. For now, let us see what happens if we use the SET statement, and,in fact, these datasets are concatenated.

    ***Programs to create data sets MASTER and TEST;

    DATA MASTER;

    INPUT SS NAME : $9.;

    DATALINES;

    123456789 CODY

    987654321 SMITH

    111223333 GREGORY

    222334444 HAMER

  • 7/30/2019 Overhead 8

    3/9

    STA4133/5133.01

    Page 3 of 9

    777665555 CHAMBLISS

    ;

    DATA TEST;

    INPUT SS SCORE;

    DATALINES;

    123456789 100

    987654321 67

    222334444 92

    ;

    DATA combine;

    SET master test;

    run;

    Examine the output from this dataset using the PROC PRINT statement. Now switch the input

    datasets and compare the outputs.

    Interleaving SAS datasets

    When datasets are concatenated as in the above example, records will be ordered simply by how

    they arrive. The records from the first listed dataset in the SET statement will be first, and they will

    be ordered according to the input dataset. Records from the dataset listed last in the SET statementwill be last, and ordered according to the input dataset. In most cases this results in an unsorted

    dataset, even if the input datasets are ordered.

    When two input datasets are ordered by some variable and you wish the combined dataset to be

    ordered by that same variable, it is more efficient to interleave the datasets (than to stack them and

    then sort them).

    The syntax for interleaving datasets is the same as that for stacking, with the addition of a BY

    statement:

    DATA new;

    SET dataset1 dataset2 ... datasetn;

    BY variable1 ... variablen;

    Some important notes:

    The input datasets be sorted by the same variables in the BY statement above. The output dataset will be sorted according to the BY statement. The total number of records resulting from the SET statement will be the sum of records in each

    input dataset.

    The total number of variables resulting will depend on the number of matching variables. If oneof the datasets has a variable that is not in another dataset, the observations from that otherdataset will have missing values for that variable.

  • 7/30/2019 Overhead 8

    4/9

    STA4133/5133.01

    Page 4 of 9

    Example. Interleave the datasets masterand test. Since these datasets have only one variable in

    common social security number the interleaving will be by that variable. The datasets are notsorted by this variable, so this should be done before the interleaving.

    PROC SORT DATA=master;

    BY ss;PROC SORT DATA=test;

    BY ss;

    DATA combine;

    SET master test;

    BY ss;

    run;

    Examine the output from this dataset using the PROC PRINT statement.

    Using theMERGE statement

    Combining datasets may involve matching on a particular variable. When this is the case, theMERGE statement should be used instead of the SET statement, for specifying input datasets.

    The general format of a MERGE statement is:

    DATA new;

    MERGE dataset1 dataset2;

    BY variable1 variable2 ... variablen;Run;

    Important

    If the two input datasets have variables with the same name (other than the BY variables) thevalue of the variables from the second dataset will overwrite the value of the variables from the

    first dataset. Use the RENAME= option to assign new names to the variables in one of the

    datasets.

    The two input datasets MUST be sorted in order of the BY variables, and the (BY) variablesmust have identical names in both datasets. If they do not have the same name, use the

    RENAME= option.

    There are a variety of ways to merge two datasets using PROC MERGE:

    1. One-to-one match merge, with a BY value2. Non-matches with a by value3. Limiting observations by using the IN= option4. One-to-many match merge

    One-to-one match merge with a BY value

  • 7/30/2019 Overhead 8

    5/9

    STA4133/5133.01

    Page 5 of 9

    The following example merges a file containing demographic data and a file of health statistics into

    a single file called pophlth. In this case both datasets have the exact same zip codes but differentvariables.

    The file of demographic data has the following observations:

    zip popsize hhsize

    78201 111111 12121278202 222222 343434

    78203 333333 565656

    The health file has the following observations:

    zip diab access

    78201 .04 .878202 .05 .7

    78203 .06 .6

    The files will be merged together by zip code, and will contain ALL variables in both input files.

    /*Always remember to sort the input datasets*/

    PROC SORT DATA=demog;

    BY zip;

    PROC SORT DATA=health;

    BY zip;

    DATA pophlth;

    MERGE demog /* demography file */

    health; /* health stat. file */

    BY zip;

    RUN;

    The new dataset pophlth will have the following observations:

    zip popsize hhsize diab heart78201 111111 121212 .04 .8

    78202 222222 343434 .05 .7

    78203 333333 565656 .06 .6

    Note: Suppose the variable for zip code is named differently for the two input files. If the health

    dataset uses the name zcode, for example, you can use the RENAME= option to change it:Merge demog

    health (RENAME = (zcode = zip));

    Non-matches with a BY value

  • 7/30/2019 Overhead 8

    6/9

    STA4133/5133.01

    Page 6 of 9

    Suppose in the previous example that the health dataset has a zip code, say 78204, that is not in thedataset demog. The MERGE statement will keep all values for the health variables and set the

    demographic variables to missing.

    As an extreme case, consider the case where no zip codes match. Consider a second health datasetwith the following observations:

    zip diab access

    78214 .04 .7

    78215 .03 .878216 .02 .9

    The same code used previously will generate a new dataset pophlth with the following

    observations:

    zip popsize hhsize diab heart78201 111111 121212 . .78202 222222 343434 . .

    78203 333333 565656 . .

    78214 . . .04 .778215 . . .03 .8

    78216 . . .02 .9

    Limiting observations by using the IN= option

    In the previous examples, all of the input records from both files belonged in the new output

    dataset. With the IN= option, you can specify which observations you wish to keep.

    Suppose the demog dataset that we used earlier is a subset of a much larger dataset with containsdemographic data for all US zip codes. Lets call this dataset Usdemog. We can merge this dataset

    with the health dataset by zip code, but we may not wish to keep all the observations with zip codes

    for which there is no health data. Consider the following code:

    DATA pophlth;

    MERGE demog (IN=indem)

    health;

    BY zip;

    IF indem THEN OUTPUT;

    RUN;

    The IN= option specifies the temporary variable indem, which is set to true only for observations

    for which the demog file contributed to the merge. The output file will contain an observation for

  • 7/30/2019 Overhead 8

    7/9

    STA4133/5133.01

    Page 7 of 9

    the three zip codes in the demog dataset, along with the associated variables from the health file. If

    there was not a matching zip code in the health file, then the health variables will be set to missing.

    If you wanted to output records only if there was a demography AND a health record for the same

    zip code you would use the following:

    DATA pophlth;

    MERGE demog (IN=indem)

    health (IN=inhlth);

    BY zip;

    IF indem AND inhlth THEN OUTPUT;

    RUN;

    You can take advantage of the IN= option to create several output datasets in one data step.

    Suppose you are merging the same files (Usdemog and health) and you wish to keep a dataset ofrecords that matched as well as a dataset of records that did not match. The code can easily be

    modified to accomplish this:

    DATA phmatch /*records that matched*/

    unm_hlth /*health records with no match in demog*/

    unm_dem; /*demog records with no match in health*/

    MERGE demog (IN=indem)

    health (IN=inhlth);

    BY zip;

    IF indem THEN DO;

    IF inhlth THEN OUTPUT phmatch;

    ELSE OUTPUT unm_dem;

    END;

    ELSE IF inhlth THEN OUTPUT unm_hlth;RUN;

    One-to-Many match merge

    Sometimes you want to merge several observations in one data set with a single observation in

    another data set. For example, suppose you have data by zip code and you need to get stateinformation for those zip codes. Since each state has multiple zip codes, you will need a one-to-

    many merge.

    The difference between this type of merge and the one-to-one merge is not in the code, but in the

    data. For this reason, it is important to know your datasets before you merge them! If, in fact, yourdatasets are not in a one-to-many format, but in a many-to-many format, your new dataset may not

    be what you expect it to be!

    Consider the following demographic dataset, which has a state identifier:

    Zip zipsize state27703 111111 12

  • 7/30/2019 Overhead 8

    8/9

    STA4133/5133.01

    Page 8 of 9

    78202 222222 28

    78203 333333 28

    Consider the following dataset with access to care information, by state:

    state access12 0.7

    23 0.828 0.6

    The following code will perform the one-to-many merge:

    DATA new;

    MERGE demog access;

    BY state;

    RUN;

    The new Data Set will contain the following observations:

    zip zipsize state access27703 111111 12 0.7

    . . 23 0.8

    78202 222222 28 0.678203 333333 28 0.6

    Updating Datasets

    You may have a master dataset that need to be updated periodically. Study data is an example of

    data that need to be updated with corrections, follow-up data, or other new information.

    The general form of the UPDATE statement is the same as for the MERGE statement, except it

    only specifies two input datasets:

    DATA master;

    UPDATE master transactions;

    BY variable1 ... variablen;

    RUN;

    Important notes:

    Input datasets must be sorted according to the BY statement. The output dataset will be sorted. The values of the BY variables must be unique in the master dataset (but not necessarily in the

    transaction/updates dataset).

    Missing values in the transaction dataset DO NOT replace existing values in the master dataset.

  • 7/30/2019 Overhead 8

    9/9

    STA4133/5133.01

    Page 9 of 9

    Example. (B 6.8)A hospital maintains a master database with patient information. Each record contains the patients

    account number, last name, address, date of birth, sex, insurance code, and the date that patients

    information was last updated. Whenever a patient is admitted to the hospital, a transaction record is

    created, containing new information and status changes. Some of the patients are new (and are notyet on the master database.)

    The code below illustrates how the master dataset is updated:

    DATAmaster;

    INPUT Account LastName $ 8-16 Address $ 17-34

    BirthDate MMDDYY10. Sex $ InsCode $ 48-50 @52 LastUpdate

    MMDDYY10.;

    DATALINES;

    620135 Smith 234 Aspen St. 12-21-1975 m CBC 02-16-1998

    645722 Miyamoto 65 3rd Ave. 04-03-1936 f MCR 05-30-1999

    645739 Jensvold 505 Glendale Ave. 06-15-1960 f HLT 09-23-1993

    874329 Kazoyan 76-C La Vista . . MCD 01-15-2003

    ;

    RUN;

    DATAtransactions;

    INPUT Account LastName $ 8-16 Address $ 17-34 BirthDate MMDDYY10.

    Sex $ InsCode $ 48-50 @52 LastUpdate MMDDYY10.;

    DATALINES;

    620135 . . . . HLT 06-15-2003

    874329 . . 04-24-1954 m . 06-15-2003

    235777 Harman 5656 Land Way 01-18-2000 f MCD 06-15-2003;

    RUN;

    PROCSORTDATA = transactions;

    BY Account;

    * Update patient data with transactions;

    DATAmaster;

    UPDATE master transactions;

    BY Account;

    RUN;

    PROCPRINTDATA = master;

    FORMAT BirthDate LastUpdate MMDDYY10.;

    TITLE'Admissions Data';

    RUN;