Chapter 20 Creating Multiple Observations from a Single Record Objectives Create multiple...

Chapter 20Creating Multiple Observations from a Single

Record

Objectives• Create multiple observations from a single record

containing repeating blocks of data• Create multiple observations from a single record

containing one ID field followed by the same number of repeating fields.

• Create multiple observations from a single records containing one ID field followed by a varying number of repeating fields

Three Situations Involving Multiple Observations from a Single record

Situation 1: Repeating blocks of data that represent separate observations:

Each record in the following data consists of three individuals’ test scores:TOM 80 JOHN 60 TERRY 90KEN 85 STAN 78 SCOTT 86

There are six individuals in this data. Three observations are in one record.

Situation 2: an ID field followed by an equal number of repeated fields that

represent separate observations

The following data set consists of individuals’ top three hobbies. An ID followed by three hobbies:

01 WLAKING RUNNING SWINNING02 GOLFING TENNIS BASEBALL03 SWINNING TENNIS BASKETBALL

Situation 3: An ID field followed by a varying number of repeating fields that

represent separate observation

The following is the transactional data from a grocery store, which records individual’s shopping list:

001 PORK CHEESE BEER VEGETABLE 002 Cake BEER WINE DOGFOOD COOKIE003 BEER CHEESE WINE004 CANDY CHOCHLATE

How does SAS read multiple observations from one single record?

SAS introduces two line-holding specifiers:• The trailing sign, @ : This sign holds the input

record for the execution of the next INPUT statement.

• The double trailing sign, @@: This sign holds the input record for the execution of the next record statement, even across iteration of the data step.

• NOTE: Both trailing @ and double trailing @@ must be the last item in the INPUT statement.

6

Trailing @ Versus Double Trailing @

Option Effect

Trailing @ INPUT var-1... @;

Holds raw data record until1) an INPUT statement with no

trailing @2) the bottom of the DATA step.

Double trailing @@ INPUT var-1 ... @@;

Holds raw data records in the input buffer until SAS reads past the end of the line.

7

Situation 1: Reading Repeating Blocks of Data

A raw data file contains each employee’s identification number and this year’s contribution to his or her retirement plan. Each record contains information for multiple employees.

E00973 1400 E09872 2003 E73150 2400E45671 4500 E34805 1980

8

Desired Output

The output SAS data set should have one observation per employee.

EmpID Contrib

E00973 1400E09872 2003E73150 2400E45671 4500E34805 1980

9

Processing: What Is Required?E00973 1400 E09872 2003 E73150 2400

Read for Obs. 1

ProcessOther

Statements

Output

Read for Obs. 2

ProcessOther

Statements

Output

Read for Obs. 3

ProcessOther

Statements

Output

...

10

Use the Double Trailing @@to read repeating blocks of data

• The double trailing @@ holds the raw data record across iterations of the DATA step until the line pointer moves past the end of the line.

INPUT var1 var2 var3 … @@;INPUT var1 var2 var3 … @@;

11

The Double Trailing @@

data work.retire; length EmpID $ 6; infile 'raw-data-file'; input EmpID $ Contrib @@;run;

Hold until endof record.

...

12

NOTE: 2 records were read from the infile 'retire.dat'. The minimum record length was 35. The maximum record length was 36.NOTE: SAS went to a new line when INPUT statement reached past the end of a line.NOTE: The data set WORK.RETIRE has 5 observations and 2 variables.

Creating Multiple Observations Per Record

Partial Log

The "SAS went to a new line" message is expected because the @@ option indicates that SAS should read until the end of each record.

13

EmpID Contrib

E00973 1400E09872 2003E73150 2400E45671 4500E34805 1980

Creating Multiple Observations Per Record

proc print data=retire noobs;run;

PROC PRINT Output

Exercise 1

Open c20_1 program. Run each program and observe the results. Make sure you learn how to use trailing @@.

Situation 2: ID followed by the same # of repeating fields

The following data consists of employee’s quarterly sales. Each record consists of Employ ID followed each of the four quarter sales. A05 2,304.53 3,012.55 2,567.12 3,835.55A06 3,249.44 4,132.75 3,655.21 4,886.32A07 1,965.34 2,540.67 2,103.65 3,023.54A08 5,341.55 5,021.40 6,011.61 7,561.48A09 3,455.91 3,122.43 3,664.13 4,721.84A10 4,678.43 5,217.90 4,633.85 5,725.35

The goal is to create a data set with the following variables:ID Quarter SalesA05 1 2304.53A05 2 3012.55A05 3 2567.12A05 4 3835.55 - - - - - - - - - - - - - -

Use Trailing @ to read records with ID followed by same # of repeating fields

Data EXPENSE1;input ID $ @;INPUT EXPENSE : COMMA. @; OUTPUT;INPUT EXPENSE : COMMA. @; OUTPUT;INPUT EXPENSE : COMMA. @; OUTPUT;INPUT EXPENSE : COMMA. @; OUTPUT;datalines;A05 2,304.53 3,012.55 2,567.12 3,835.55A06 3,249.44 4,132.75 3,655.21 4,886.32A07 1,965.34 2,540.67 2,103.65 3,023.54A08 5,341.55 5,021.40 6,011.61 7,561.48A09 3,455.91 3,122.43 3,664.13 4,721.84A10 4,678.43 5,217.90 4,633.85 5,725.35;

PROC PRINT;title 'Use @ - read ID, then, read mulptiple expenses for EXPENSE variable';RUN;

Using DO Loop and Trailing @ together

data expense2;input id $ @;do quarter = 1 to 4;input expense: comma. @ ; output;end;datalines;A05 2,304.53 3,012.55 2,567.12 3,835.55A06 3,249.44 4,132.75 3,655.21 4,886.32A07 1,965.34 2,540.67 2,103.65 3,023.54A08 5,341.55 5,021.40 6,011.61 7,561.48A09 3,455.91 3,122.43 3,664.13 4,721.84A10 4,678.43 5,217.90 4,633.85 5,725.35;proc print; title 'Use @ - read ID, then, DO loop to read muptiple expenses for EXPENSE

variable';run;

Exercise 2

Open the program c20_2Run each program and learn how to use one single trailing @ .

Situation 3: Reading data with ID followed by varying # of repeating fields

The following data consists of employee’s quarterly sales. Each record consists of Employ ID followed each of the four quarter sales. A05 2,304.53 3,012.55 2,567.12 3,835.55A06 3,249.44 4,132.75 3,655.21A07 1,965.34 2,540.67 2,103.65 3,023.54A08 5,341.55 5,021.40A09 3,455.91 3,122.43 3,664.13 4,721.84A10 4,678.43 5,217.90 4,633.85 5,725.35

The goal is to create a data set with the following variables:ID Quarter SalesA05 1 2304.53A05 2 3012.55A05 3 2567.12A05 4 3835.55 - - - - - - - - - - - - - -

NOTE: There are some quarterly sales missing at the 3rd and 4th quarter. As a result, the # of repeating fields varies.

Reading records with ID followed by varying # of repeating fields

To read this type of data, we need to use trailing @ to hold the input statement to process the data step, then read next variables until @, then, process these variables as a record until the end of the data step, then, read the next variables and so on.

If the # of repeating fields are not same, then, one can consider there are missing data at the end of the record, and apply the MISSOVER option in the INFILE statement to handle the varying # of fields.

NOTE: If the record length is not fixed, then, PAD option will be needed to fix the record length problem.

data expense3;infile datalines missover pad;input id $ EXPENSE : COMMA. @;QUARTER = 0;do until (EXPENSE eq . ); QUARTER+1; output; input expense: comma. @ ; end;datalines;A05 2,304.53 3,012.55 2,567.12 3,835.55 A06 3,249.44 4,132.75 3,655.21 A07 1,965.34 2,540.67 2,103.65 3,023.54 A08 5,341.55 5,021.40 A09 3,455.91 3,122.43 3,664.13 4,721.84 A10 4,678.43 5,217.90 4,633.85 5,725.35 ;proc print; title 'Use @ - read ID, then, use DO WHILE to read multiple

expenses for EXPENSE variable';run;

NOTE: This program uses MISSOVER option to handle the missing at the end of a record and

use PAD option to take care of the variable record length problem.

To read the record with ID followed by varying # of fields with missing in the middle or beginning

• It is possible that there are missing data in the beginning, middle or end of a record.

• To handle this situation, in addition to using MISSOVER and PAD, one may use DSD

• It is possible that the data is recorded in free format, and list input will be needed. In this situation, it usually also requires to specify a delimiters using DLM = ‘delimiters’ option

data expense4;infile datalines dlm = '/' missover dsd;input id $ EXPENSE : COMMA. @;quarter=0;do until (EXPENSE eq . ); QUARTER+1; output; input expense: comma. @ ; end;datalines;A05/2,304.53 /3,012.55 /2,567.12/ 3,835.55A06 // 4,132.75/ 3,655.21/ 4,886.32A07/ 1,965.34 /2,540.67/ 2,103.65/ 3,023.54A08/ 5,341.55/ 5,021.40/ 6,011.61 /7,561.48A09 /3,455.91/ 3,122.43 /3,664.13/ A10/ 4,678.43/ 5,217.90/ 4,633.85/ 5,725.35;proc print; title 'Use @ - read ID, the, use DO UNTIL to read multiple expenses

for EXPENSE variable';run;

This program uses MISSOVER to handle the missing at the end for varying # of repeating fields, Use DSD to handle the missing in the middle.

Exercise 3

Open program c20_3.Run each program and learn how to read multiple observations from single record using Trailing @ for situation when # of repeats are different.

Chapter 20 Creating Multiple Observations from a Single Record Objectives Create multiple...

Documents

Transcript of Chapter 20 Creating Multiple Observations from a Single Record Objectives Create multiple...