Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these...

39
Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page http://faculty.haas.berkeley.edu/peliu/comput ing

Transcript of Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these...

Page 1: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

Haas MFE SAS Workshop

Lecture 2: The Data ManagementAlex Vedrashko

For sample code and these slides, see Peng Liu’s pagehttp://faculty.haas.berkeley.edu/peliu/computing

Page 2: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

2

Creating datasets (recap from L1)

The ultimate goal: save data to disk as a permanent SAS dataset. Read from a SAS library file (e.g. downloaded from CRSP).

libname mylib ‘r:\temp\’;data d; set mylib.crspsample; run;

Read using INFILE and INPUT from an external file. Example from Lect 1: DATA LOAN1;INFILE 'R:\bulk\SAS\MFE\loan.txt' DELIMITER=',';INPUT ID Origination mmddyy10. Term Rate Balance

Appraisal LTV FICO_orig City $ State $2. ;

Use SAS menu “File – Import Data”. Very flexible.

Read using INPUT and DATALINES (CARDS).DATA portf; INPUT portfolioreturn @;datalines;15.9 -2.1 0.3;

Page 3: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

3

Viewing Datasets

Browse the saved SAS dataset with extension sas7bdat. Double click on the file in windows explorer or use

Browse the saved SAS dataset in SAS Explorer. Click on Libraries icon, then on your library name.

Datasets are automatically assigned to the WORK library if no libname is given, e.g. dataset d1 is actually WORK.d1 in

data d1; set mylib.d0; … run;

The WORK library is temporary--all datasets in it disappear when you close SAS.

Page 4: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

4

Viewing Datasets: PUT statement

1. Proc Print; (see lec.1)2. PUT statement in the DATA step. Syntax: PUT variable names; Writes

to the LOG window. Useful for debugging and simple output to text files.data d; set loan1;put origination term city;run; Show variable names: put origination= term= city;

Output to a file, rather than LOG window.filename f "r:\mysasoutput.txt";data d; set loan1;file f; put origination term city;run; Preset SAS variables: _N_ (stands for “observation number”) , _all_ (stands for “all variables”)put _n_ city state;

Page 5: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

5

FORMAT statement

FORMAT is an instruction that tells SAS how to write variable values

Specifying Format is usually necessary to make Date variables readable.

Permanently associate format with a variable in a given dataset:data d; set loan1;format Origination mmddyy8.; put Origination rate;proc print; run; Put statement or Proc Print automatically use this format. Temporarily: You can also specify format in Proc Printproc print;format Origination mmddyy8.;run;

SAS stores dates as the number of days from Jan. 1, 1960.

Page 6: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

6

Data Type Conversion: the issue

SAS only has three data types: Numeric, Character and Date/time.

When you accidentally mix variable types, SAS tries to fix your program by converting them.

Log File - “Note: Numeric Values have been converted to Character!” Cannot ignore this!

For example: 110 can be numeric or character, when you use numerical function on character variables or vice versa. SAS tries to convert to appropriate data type first, then perform function calculations.

How to Fix? A practical way is to use INPUT/PUT functions. Close cousin of input/put statements, but different!

Page 7: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

7

Data Type Conversion: Solution

Character to Numeric

new=INPUT (old, informat);

Informat must be the type you are converting to – numeric or SASdate

Rate_num=input(rate_chr,5.);

To verify, apply a numeric function:Intgr_r=floor(Rate_num);

Numeric to Character

new=PUT (old, format);

Format must be the type you are converting from – numeric

Rate_chr = put (rate, 5.2)

To verify, apply a character function:Digit1=substr(Rate_chr,1,1);

Page 8: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

8

Titles and Footnotes

SAS allows up to ten lines of text at the top (titles) and bottom(footnotes) of each page of output, specified with title andfootnote statements. The form of these statements is

title<n> text; or footnote<n> text;where n, if specified, can range from 1 to 10, and text must besurrounded by single quotes or no quotes.

Title ‘Mortgage dataset’;Proc print; run;

If text is omitted, the title or footnote is deleted; otherwise it remains in effect until it is redefined. Thus, to have no titles, use:

title; By default SAS includes the date and page number on the top of

each piece of output. These can be suppressed with the nodate and nopage system options.

Page 9: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

9

System Options for Output Control

Syntax: option opt;

Useful options to manage how SAS output (the OUTPUT window) looks like:

Date/nodate (shows current date)

Number/nonumber (shows pagenumber)

Center/nocenter (centers output – useful for proc means, etc.)

formdlim = '-'; (defines the delimiter between pages. Results in more readable output of econometric proc’s)

See all available options (in the LOG window): proc options;

Page 10: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

10

IF-THEN-ELSE

The DATA step is where all variable assignment takes place Sometimes you will want to condition assignment byusing IF-THEN-ELSE statement

IF condition THEN action; ELSE IF condition THEN action; … ELSE action;

Example:data p1; set portf;if portfolioreturn>10 then promotion=1; else if portfolioreturn<0 then promotion=-1; /*fired*/else promotion=0;run; The ELSE statements are optional With the above syntax you can only assign a single action with

each statement

Page 11: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

11

IF-THEN-DO-END Use a DO-END loop inside of an IF-THEN statement to performmultiple actions on the given condition

IF condition THEN DO; action; action; END;

Examples:if portfolioreturn>10 then do; promotion=1;

bonus=50000+10000*sqrt(portfolioreturn); end;else if portfolioreturn<0 then do; promotion=-1; bonus=10000; end;else do; promotion=0; bonus=50000; end;

Conditions can be specified with symbols or mnemonics

= EQ > GT ^= , ~= NE >= GE

& AND < LT | , ! OR <= LE

Page 12: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

12

Logical Conditions

Other useful conditions can be set by the following: IF var1 IN(val1, val2, val3 …) THEN …;if state in ('OR', 'WA', 'CA') then region='Pacific';

IF var1 BETWEEN val1 AND val3 THEN …;if GPA between 3.7 and 4 then letterGPA=‘A’;Alternatively: if 3.7<=GPA<=4 then letterGPA=‘A’;

Conditions can contain functions, numeric and character variables, constants, and mathematical expressionsif rate**2 > 25 then highsqrate=1;

Page 13: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

13

Statements and Options That Control Reading and Writing

Task Statements Data set

options System options

Manage variables DROP DROP=

KEEP KEEP=

RENAME RENAME=

Manage observations

WHERE WHERE=

subsetting IF FIRSTOBS= FIRSTOBS=

DELETE OBS= OBS=

OUTPUT

Page 14: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

14

Manage variables: KEEP, DROP statements

DROP list-of-variables tells which variables from the input dataset should NOT be included in the output dataset.

data d1; set d; drop i j temp_variable;

KEEP list-of-variables tells which variables from the input dataset should be included in the output dataset (the other variables are dropped).

data d1; set d; keep rate balance;

Page 15: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

15

Manage Observations: Subsetting IF statement. DELETE statement

A special case of the IF-THEN statement is an IF statement without a ‘then’ action, i.e. IF condition;

data d1; set d; if Origination>'01Jan2002'd; rte=rate/100; run; If the condition is true, then SAS continues with the DATA

statements for this observation. Otherwise no further statements are processed for that observation,

and the observation is not added to the data set.

To delete certain observations (the opposite of the subsetting IF statement) use: IF condition THEN DELETE;

data d1; set d; if Origination>'01Jan2002'd then delete;

Page 16: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

16

Subsetting IF: dealing with missing observations

If you want to keep only non-missing observations:if portfolioreturn; leaves only observations

where portfolioreturn is not missing. An equivalent statement: if portfolioreturn^=.;

Note that a missing value in SAS is considered to be smaller than all numeric or character values.

Thus, if portfolioreturn<0; includes observations with missing returns!

To avoid “firing” traders with missing return records, add:

if portfolioreturn=. then promotion=.;

Page 17: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

17

Manage Observations:WHERE statement

Alternative to IF statement for sub-setting data WHERE condition; data d1; set loan1; where year(Origination)>2002;

Differences between IF and WHERE: http://support.sas.com/faq/042/FAQ04278.html

WHERE can be used in both DATA and PROC steps. IF is only for DATA steps

proc print data=loan1; where year(Origination)>2002;

Can use WHERE with CONTAINS operator:data d1; set data=loan1; where city contains 'SANTA';

WHERE cannot be used to modify data from INPUT statements. Only to control data that comes from existing SAS data sets via SET or MERGE. (wrong use: data d; input a; where a>0;)

WHERE cannot be applied to new variables created in the current DATA step; IF can. (wrong: data d1; set loan1; where Rate_num>3; Use: If Rate_num>3;)

Page 18: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

18

Manage Observations:Data Set Options

WHERE = conditiondata d1; set d (where= (2002<=year(Origination)<2005));

KEEP = variable list, DROP = variable list

data d1; set d (keep= origination rate);data d1 (drop=x y); infile f; input x y z; ... run; Tells SAS to rename certain variables.RENAME = (oldvar = newvar)data d1; set LOAN1 (rename= (origination=issued));

Page 19: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

19

Dataset options (contd.)

Start reading from observation # n. Syntax:FIRSTOBS = n Stop reading at observation # n. Syntax:OBS = n Data d1; set d (firstobs=5 obs=20); … run; In procedures:proc means data=d (firstobs=5 obs=20); run;

Here Proc Means analyzes only observations 5 through 20 of the data set d.

Page 20: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

20

Concatenating Two Data Sets

Concatenating the data sets appends the observations from one data set to another data set.

The DATA step reads DATA1 sequentially until all observations have been processed, and then reads DATA2.

Data set COMBINED contains the results of the concatenation.

Note that the data sets are processed in the order in which they are listed in the SET statement.

Page 21: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

21

Interleaving Two Data Sets

The datasets must be sorted by the values of the variables listed in the BY statement.

Similar to Concatenating, but preserves the sorting order.

Page 22: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

22

One-to-One Reading and One-to-One Merging

(use this method with caution) One-to-one reading combines

observations from two or more SAS data sets by creating observations that contain all of the variables from each contributing data set.

The first observation in one data set with the first in the other, and so on.

The DATA step stops after it has read the last observation from the smallest data set.

One-to-one merging is similar to a one-to-one reading, with two exceptions: you use the MERGE statement instead of multiple SET statements, and the DATA step reads all observations from all data sets.

Page 23: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

23

Match-Merging (most common data set manipulation)

Match-merging combines observations from two or more SAS data sets into a single observation in a new data set based on the values of one or more common variables.

Page 24: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

24

Updating

• Input data sets must be sorted by the values of the variables listed in the BY statement. (In this example, MASTER and TRANSACTION are both sorted by Year.)• UPDATE replaces an existing file with a new file• UPDATE does not replace nonmissing values in a master data set with missing values from a transaction data set.

Page 25: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

25

Merging datasets

1. Sort the datasets according to the var list in BY.2. Use the MERGE statement inside a DATA step.proc sort data=d1; by var_list;proc sort data=d2; by var_list;DATA newdata; MERGE d1 d2 …; BY var_list; The input data sets specified in MERGE will not be modified

Values of any common variables not specified in the BY statement are likely to be mixed up in the new data set. To prevent this, use the RENAME data set option

Example. Dataset d1 has variable ret containing market returns, and d2 has variable ret containing individual stock returns. Merge the dataset by tradedate.

We rename ret in d1 to mktret:data newd; merge d1 (rename= (rate=loanr)) d2; by origination;

Page 26: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

26

Merging datasets. IN= Data Set Option

The IN= option allows the user to omit observations that are not common to all data sets.

Creates a temp. variable for tracking whether that data set contributed to the current observation

IN = index_var_namedata d; merge d1 (in=indicator1) d2; by tradedate;if indicator1; Indicator is 1 if the data set contributed and 0 otherwise Use the IF statement on the index variables. In the above

example, only observations found in d1 will be included in d.

Variable will not be written to the new data set. To include it in d, assign its value to a standard variable, e.g. ind1=indicator1;

Page 27: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

27

Example of IN= option. Merging by ID variable.

Dataset d1 (in=a) Dataset d2 (in=b)

ID V1 ID V2

1 343

2 421 2 85

3 129

4 122 4 763

5 229

6 534 6 554

7 343

8 324 8 895

Page 28: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

28

If a; (i.e. observation must be in dataset d1)

Dataset d1 (in=a) Dataset d2 (in=b)

ID V1 ID V2

1 343

2 421 2 85

3 129 .

4 122 4 763

5 229

6 534 6 554

7 343 .

8 324 8 895

Page 29: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

29

If b; (i.e. observation must be in dataset d2)

Dataset d1 (in=a) Dataset d2 (in=b)

ID V1 ID V2

. 1 343

2 421 2 85

3 129

4 122 4 763

. 5 229

6 534 6 554

7 343

8 324 8 895

Page 30: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

30

Preview of Lecture 3:

Procedures for dataset manipulation:PROC APPEND adds the observations from

one SAS data set to the end of another SAS data set.

PROC SQL reads observations from up to 32 SAS data sets and joins them into single observations; manipulates observations in a SAS data set in place; easily produces a Cartesian product.

Page 31: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

31

The OUTPUT statement is used in the datastep to write the current values of all variables to a data set. There is an IMPLICIT output statement at the end of each datastep iteration (unless an output statement appears somewhere in the datastep).

The following pieces of code are equivalent:data d; input r1-r9; run; cards; ... anddata d; input r1-r9; output; run; cards; ...

The OUTPUT statement is commonly used to create several SAS data sets in a single datastep. Specify the dataset name after OUTPUT.

Example: Split the mortgage data into separate datasets for each state.data ca wa; set loan1;if state='WA' then OUTPUT wa;if state='CA' then OUTPUT ca;proc print data=wa; proc print data=ca; run;

Once an OUTPUT statement is specified, the implied OUTPUT at the end of the DATA step no longer exists and all observation writing must be specified by the user.

OUTPUT command

Page 32: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

32

DO loop

Example: The input data is a line of four quarters of earnings for 100 firms. Read the data, indexing each observation by quarter.

data earnings;input ticker $ @;

do quarter=1 to 4;input earn @;output;

end;Datalines; ibm 10.2 15 12 8msft 25.1 27 29.4 35;run;

Other examples:do state='CA','OR'; ... end; do weekdays=1,3,5; ... end;

Output:

Obs ticker quarter earn1 ibm 1 10.22 ibm 2 15.03 ibm 3 12.04 ibm 4 8.05 msft 1 25.16 msft 2 27.07 msft 3 29.48 msft 4 35.0

Page 33: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

33

Variable Arrays Arrays are used mainly to group variables

Useful for performing the same calculations on a group of variables or searching through a set of variables.

For example, your balance sheet data variables d_1 … d_150 are in millions, and you need to make them in 100s of millions.

Arrays defined using the ARRAY statement in a DATA step Syntax: ARRAY name (n) variable_list;ARRAY all_vars var1-var10; n is the number of elements in the array and is optional The variable list is also optional but either n or the variable list

must be specified The variable list can contain variables that have not yet been

created – option for initializing variable values A $ should precede a variable list of character variables

Page 34: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

34

Arrays (cont.)

In the calculation section of a DATA step the array can be referenced by name(i) where i is the position of the element you wish to refer to Since parenthesis are also used in functions it is not a good

idea to give your array the same name as a SAS function Example.DATA d1; input var1-var10;ARRAY all_vars var1-var10;DO i = 1 to 10;

all_vars(i) = i/100;END;

RUN;

Page 35: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

35

Controlling the Built-in Data Loop:RETAIN Statement

The built-in loop stores the data for a given observation for the current run of the DATA step

When the loop reaches the end of the DATA step and returns to the top to read for the next observation all values are reset to missing

To force the built-in loop to keep values from previous observations use the RETAIN statement: RETAIN variable-list;

The values of the variables specified in the RETAIN statement will keep their values until they are reset by an INPUT or assignment statement.

Example: Calculate the highest mortgage balance to date.proc sort data=loan1; by origination;data d1; set loan1; retain maxbal; maxbal=max(maxbal, balance); run;

Page 36: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

36

Controlling the Built-in Data Loop:SUM statement

A special case is a plus sign in an assignment that does not have an equal sign, e.g. cumsum + newvar;

This sum implicitly retains the previous value of newvar and adds it to cumsum.

Example. Calculate the growth of the total appraised value of houses in the dataset to date.

proc sort data=loan1; by origination;data d1; set loan1; totalvalue + appraisal; run;

This is equivalent to retain totalvalue 0; /*initialize to 0*/totalvalue =sum(totalvalue, appraisal);

Page 37: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

37

LAG Function

In general SAS is not very convenient about directly accessing observations, e.g. for particular dates

If you want to do serious time series analysis you should use the procedures in the SAS/ETS package

The lag function is used to reference previous values of a variable newvar = LAG (variable); or newvar = LAGn (variable); Where n refers to the number of observations to go back

Example (quarterly earnings): lag2_earn = LAG2(earn);

If we are in observation 100, for example, this statement will assign the price from observation 98 to the variable lag2_price in observation 100. Similarly the value of lag2_price in observation 98 will be the value of price in observation 96.

Page 38: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

38

LAG Function (contd.) The order of observations is determined by the

current sort BY does not work with lags. So you need do manual

checks to prevent nonsensical lags when dealing with panel data. (This is the issue with the earnings example).

Lags are tricky to use because of the built-in loop. Sometimes the lag value is not available (missing).

The lag queue is not initialized until the lag function is called.

Similarly the lag queue is not updated until the lag function is called

Hints: Use separate data steps to create lags and levels

Do not use the LAG function in a loop

Page 39: Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page .

39

Lecture 2 References

SAS onlinedoc > “BASE SAS”, “SAS Language Reference: Dictionary” > “Data step options”

Manuals in pdf: http://www.math.wpi.edu/saspdf/common/mainpdf.htm

“Base SAS” section

SAS User Group International “Beginning tutorials” http://www.lexjansen.com/cgi-bin/sugi.php?x=sbt&s=sugi_s

Merging datasets: http://support.sas.com/techsup/technote/ts644.html