Chapter 20 Creating Multiple Observations from a Single Record Objectives Create multiple...
-
Upload
edmund-mosley -
Category
Documents
-
view
232 -
download
4
Transcript of Chapter 20 Creating Multiple Observations from a Single Record Objectives Create multiple...
Chapter 20Creating Multiple Observations from a Single
Record
Objectives• Create multiple observations from a single record
containing repeating blocks of data• Create multiple observations from a single record
containing one ID field followed by the same number of repeating fields.
• Create multiple observations from a single records containing one ID field followed by a varying number of repeating fields
Three Situations Involving Multiple Observations from a Single record
Situation 1: Repeating blocks of data that represent separate observations:
Each record in the following data consists of three individuals’ test scores:TOM 80 JOHN 60 TERRY 90KEN 85 STAN 78 SCOTT 86
There are six individuals in this data. Three observations are in one record.
Situation 2: an ID field followed by an equal number of repeated fields that
represent separate observations
The following data set consists of individuals’ top three hobbies. An ID followed by three hobbies:
01 WLAKING RUNNING SWINNING02 GOLFING TENNIS BASEBALL03 SWINNING TENNIS BASKETBALL
Situation 3: An ID field followed by a varying number of repeating fields that
represent separate observation
The following is the transactional data from a grocery store, which records individual’s shopping list:
001 PORK CHEESE BEER VEGETABLE 002 Cake BEER WINE DOGFOOD COOKIE003 BEER CHEESE WINE004 CANDY CHOCHLATE
How does SAS read multiple observations from one single record?
SAS introduces two line-holding specifiers:• The trailing sign, @ : This sign holds the input
record for the execution of the next INPUT statement.
• The double trailing sign, @@: This sign holds the input record for the execution of the next record statement, even across iteration of the data step.
• NOTE: Both trailing @ and double trailing @@ must be the last item in the INPUT statement.
6
Trailing @ Versus Double Trailing @
Option Effect
Trailing @ INPUT var-1... @;
Holds raw data record until1) an INPUT statement with no
trailing @2) the bottom of the DATA step.
Double trailing @@ INPUT var-1 ... @@;
Holds raw data records in the input buffer until SAS reads past the end of the line.
7
Situation 1: Reading Repeating Blocks of Data
A raw data file contains each employee’s identification number and this year’s contribution to his or her retirement plan. Each record contains information for multiple employees.
E00973 1400 E09872 2003 E73150 2400E45671 4500 E34805 1980
8
Desired Output
The output SAS data set should have one observation per employee.
EmpID Contrib
E00973 1400E09872 2003E73150 2400E45671 4500E34805 1980
9
Processing: What Is Required?E00973 1400 E09872 2003 E73150 2400
Read for Obs. 1
ProcessOther
Statements
Output
Read for Obs. 2
ProcessOther
Statements
Output
Read for Obs. 3
ProcessOther
Statements
Output
...
10
Use the Double Trailing @@to read repeating blocks of data
• The double trailing @@ holds the raw data record across iterations of the DATA step until the line pointer moves past the end of the line.
INPUT var1 var2 var3 … @@;INPUT var1 var2 var3 … @@;
11
The Double Trailing @@
data work.retire; length EmpID $ 6; infile 'raw-data-file'; input EmpID $ Contrib @@;run;
Hold until endof record.
...
12
NOTE: 2 records were read from the infile 'retire.dat'. The minimum record length was 35. The maximum record length was 36.NOTE: SAS went to a new line when INPUT statement reached past the end of a line.NOTE: The data set WORK.RETIRE has 5 observations and 2 variables.
Creating Multiple Observations Per Record
Partial Log
The "SAS went to a new line" message is expected because the @@ option indicates that SAS should read until the end of each record.
13
EmpID Contrib
E00973 1400E09872 2003E73150 2400E45671 4500E34805 1980
Creating Multiple Observations Per Record
proc print data=retire noobs;run;
PROC PRINT Output
Exercise 1
Open c20_1 program. Run each program and observe the results. Make sure you learn how to use trailing @@.
Situation 2: ID followed by the same # of repeating fields
The following data consists of employee’s quarterly sales. Each record consists of Employ ID followed each of the four quarter sales. A05 2,304.53 3,012.55 2,567.12 3,835.55A06 3,249.44 4,132.75 3,655.21 4,886.32A07 1,965.34 2,540.67 2,103.65 3,023.54A08 5,341.55 5,021.40 6,011.61 7,561.48A09 3,455.91 3,122.43 3,664.13 4,721.84A10 4,678.43 5,217.90 4,633.85 5,725.35
The goal is to create a data set with the following variables:ID Quarter SalesA05 1 2304.53A05 2 3012.55A05 3 2567.12A05 4 3835.55 - - - - - - - - - - - - - -
Use Trailing @ to read records with ID followed by same # of repeating fields
Data EXPENSE1;input ID $ @;INPUT EXPENSE : COMMA. @; OUTPUT;INPUT EXPENSE : COMMA. @; OUTPUT;INPUT EXPENSE : COMMA. @; OUTPUT;INPUT EXPENSE : COMMA. @; OUTPUT;datalines;A05 2,304.53 3,012.55 2,567.12 3,835.55A06 3,249.44 4,132.75 3,655.21 4,886.32A07 1,965.34 2,540.67 2,103.65 3,023.54A08 5,341.55 5,021.40 6,011.61 7,561.48A09 3,455.91 3,122.43 3,664.13 4,721.84A10 4,678.43 5,217.90 4,633.85 5,725.35;
PROC PRINT;title 'Use @ - read ID, then, read mulptiple expenses for EXPENSE variable';RUN;
Using DO Loop and Trailing @ together
data expense2;input id $ @;do quarter = 1 to 4;input expense: comma. @ ; output;end;datalines;A05 2,304.53 3,012.55 2,567.12 3,835.55A06 3,249.44 4,132.75 3,655.21 4,886.32A07 1,965.34 2,540.67 2,103.65 3,023.54A08 5,341.55 5,021.40 6,011.61 7,561.48A09 3,455.91 3,122.43 3,664.13 4,721.84A10 4,678.43 5,217.90 4,633.85 5,725.35;proc print; title 'Use @ - read ID, then, DO loop to read muptiple expenses for EXPENSE
variable';run;
Exercise 2
Open the program c20_2Run each program and learn how to use one single trailing @ .
Situation 3: Reading data with ID followed by varying # of repeating fields
The following data consists of employee’s quarterly sales. Each record consists of Employ ID followed each of the four quarter sales. A05 2,304.53 3,012.55 2,567.12 3,835.55A06 3,249.44 4,132.75 3,655.21A07 1,965.34 2,540.67 2,103.65 3,023.54A08 5,341.55 5,021.40A09 3,455.91 3,122.43 3,664.13 4,721.84A10 4,678.43 5,217.90 4,633.85 5,725.35
The goal is to create a data set with the following variables:ID Quarter SalesA05 1 2304.53A05 2 3012.55A05 3 2567.12A05 4 3835.55 - - - - - - - - - - - - - -
NOTE: There are some quarterly sales missing at the 3rd and 4th quarter. As a result, the # of repeating fields varies.
Reading records with ID followed by varying # of repeating fields
To read this type of data, we need to use trailing @ to hold the input statement to process the data step, then read next variables until @, then, process these variables as a record until the end of the data step, then, read the next variables and so on.
If the # of repeating fields are not same, then, one can consider there are missing data at the end of the record, and apply the MISSOVER option in the INFILE statement to handle the varying # of fields.
NOTE: If the record length is not fixed, then, PAD option will be needed to fix the record length problem.
data expense3;infile datalines missover pad;input id $ EXPENSE : COMMA. @;QUARTER = 0;do until (EXPENSE eq . ); QUARTER+1; output; input expense: comma. @ ; end;datalines;A05 2,304.53 3,012.55 2,567.12 3,835.55 A06 3,249.44 4,132.75 3,655.21 A07 1,965.34 2,540.67 2,103.65 3,023.54 A08 5,341.55 5,021.40 A09 3,455.91 3,122.43 3,664.13 4,721.84 A10 4,678.43 5,217.90 4,633.85 5,725.35 ;proc print; title 'Use @ - read ID, then, use DO WHILE to read multiple
expenses for EXPENSE variable';run;
NOTE: This program uses MISSOVER option to handle the missing at the end of a record and
use PAD option to take care of the variable record length problem.
To read the record with ID followed by varying # of fields with missing in the middle or beginning
• It is possible that there are missing data in the beginning, middle or end of a record.
• To handle this situation, in addition to using MISSOVER and PAD, one may use DSD
• It is possible that the data is recorded in free format, and list input will be needed. In this situation, it usually also requires to specify a delimiters using DLM = ‘delimiters’ option
data expense4;infile datalines dlm = '/' missover dsd;input id $ EXPENSE : COMMA. @;quarter=0;do until (EXPENSE eq . ); QUARTER+1; output; input expense: comma. @ ; end;datalines;A05/2,304.53 /3,012.55 /2,567.12/ 3,835.55A06 // 4,132.75/ 3,655.21/ 4,886.32A07/ 1,965.34 /2,540.67/ 2,103.65/ 3,023.54A08/ 5,341.55/ 5,021.40/ 6,011.61 /7,561.48A09 /3,455.91/ 3,122.43 /3,664.13/ A10/ 4,678.43/ 5,217.90/ 4,633.85/ 5,725.35;proc print; title 'Use @ - read ID, the, use DO UNTIL to read multiple expenses
for EXPENSE variable';run;
This program uses MISSOVER to handle the missing at the end for varying # of repeating fields, Use DSD to handle the missing in the middle.
Exercise 3
Open program c20_3.Run each program and learn how to read multiple observations from single record using Trailing @ for situation when # of repeats are different.