Ch 2. DATA Step

download Ch 2. DATA Step

of 30

description

dsf

Transcript of Ch 2. DATA Step

PowerPoint Presentation

Chapter 2Getting Your Data into SAS1Methods to Get Your Data Into SASSection 12Methods for Getting Data In Normally, our data is outside of the SAS environment. We need to get them in. There is always a way to get the data into SAS no matter where it resides.

There are four general categories to get your data into SAS:Entering data directly into SAS data sets. (viewtable)Creating SAS data sets from raw data files. (using DATA steps, IMPORT)Converting other softwares data files into SAS data sets. (IMPORT)Reading other softwares data files directly. (SAS / ACCESS out of this courses scope)

Of course, the method will depend on where your data is located and what tools are available to you.

3Reading Data into SASThere are 4 options to read data into SAS for us:

INFILE statement (external raw data) in DATA StepList input Column inputInput with informatsUsing cards or datalines statements (internal raw data) Importing the data set via SAS interface not available in SAS Studio`Importing the data set via PROC IMPORT not covered in this course

4Entering data with DATA StepWhen to use DATA steps?To get your data into SASTo make the data set SAS compatible (although this is not the case for our courses datasets)Sometimes, a part of the variables or observations are needed to be extracted for further analysisDifferent procedures may require the same data set in different format. A data step is needed to transform the dataset into the appropriate format for a procedure

We will be talking about this method a lot during this chapter and the next couple of chapters. 5INFILE StatementINFILE statementFormat:DATA ;INFILE ;INPUT ;

The input type can be ListColumnInformats

UNIX: INFILE /home/mydir/president.dat; /*for SAS Studio*/Windows: INFILE c:\MyDir\President.dat; /*for Base SAS*/

SAS log gives very valuable information.

6Data Separated with SpacesIt is very easy, but easy comes with some limitations:Values are separated with at least one space,Values are at most 8 characters long,Data must be read all at once, no skipping over any dataNo dates, if possible, dates need special care.Despite the limitations, it is a very popular method with raw data.

INPUT Name $ Age Height;

It is also called list input because of the format of the statement. The variable names are listed in the INPUT statement by the order of appearance in the data set.

Example 2.1.1 Data Separated with SpacesYour hometown has been overrun with toads this year. A local resident, having heard of frog jumping in California, had the idea of organizing a toad jump to cap off the annual town fair. For each contestant you have the toads name, weight, and the jump distance from three separate attempts. If the toad is disqualified for any jump, then a period is used to indicate missing data. Here is what the data file ToadJump.dat looks like:

Lucky 2.3 1.9 . 3.0 Spot 4.6 2.5 3.1 .5Tubs 7.1 . . 3.8Hop 4.5 3.2 1.9 2.6Noisy 3.8 1.3 1.8 1.5Winner 5.7 . . .

8Data Arranged in ColumnsAlso called column input.There are no delimiters. Instead, each of the variables values always start on the same column in the dataset. Values are characters or standard numeric. Numeric values cannot have thousand-separators or special date formats.Advantages over list inputNo need for spaces between valuesMissing values can be left blankCharacter data can have embedded spaces. This is a very good sign to use this input type.You can skip unwanted variables.

INPUT Name $ 1-10 Age 11-13 Height 14-18;

The local minor league baseball team, the Walla Walla Sweets, is keeping records about concession sales. A ballpark favorite are the sweet onion rings which are sold at the concession stands and also by vendors in the bleachers. The ballpark owners have a feeling that in games with lots of hits and runs more onion rings are sold in the bleachers than at the concession stands. They think they should send more vendors out into the bleachers when the game heats up, but need more evidence to back up their feelings.

For each home game they have the following information: name of opposing team, number of onion ring sales at the concession stands and in the bleachers, the number of hits for each team, and the final score for each team. The following is a sample of the data file named OnionRing.dat.For your reference, a column ruler showing the column numbers has been placed above the data:

/*----+----1----+----2----+----3----+----4Columbia Peaches 35 67 1 10 2 1Plains Peanuts 210 2 5 0 2Gilroy Garlics 151035 12 11 7 6Sacramento Tomatoes 124 85 15 4 9 1

Example 2.1.2 Data Arranged in Columns

10Reading Data into SAS (non-standard formats)Informats are useful anytime you have non-standard data.There are three general types of informats: character, numeric, and date.

Character Numeric Date$informatw. informatw.d informatw.

The $ indicates character informats, w is the total width,d is the number of decimal places (numeric informats only).

INPUT Name $10. Age 3. Height 5.1 BirthDate MMDDYY10.;

The period is very important. It is often overlooked.

A list of useful informats

Example 2.1.3This example illustrates the use of informats for reading data. The following data file, Pumpkin.dat, represents the results from a local pumpkin-carving contest. Each line includes the contestants name, age, type (carved or decorated), the date the pumpkin was entered, and the scores from each of five judges.Alicia Grossman 13 c 10-28-2008 7.8 6.5 7.2 8.0 7.9Matthew Lee 9 D 10-30-2008 6.5 5.9 6.8 6.0 8.1Elizabeth Garcia 10 C 10-29-2008 8.9 7.9 8.5 9.0 8.8Lori Newcombe 6 D 10-30-2008 6.7 5.6 4.9 5.2 6.1Jose Martinez 7 d 10-31-2008 8.9 9.510.0 9.7 9.0Brian Williams 11 C 10-29-2008 7.8 8.4 8.5 7.9 8.0

Column Pointer @When using mix input styles, there is one possible complication. When SAS reads a line of raw data it uses a pointer to mark its place, but each style of input uses the pointer differently. With list style input, SAS automatically scans to the next non-blank field and starts reading. With column style input, SAS starts reading in the exact column you specify. But with formatted input, SAS just starts readingwherever the pointer is, that is where SAS reads.

@n moves the pointer to the nth column.

Example 2.1.4 Challenge Try to import NatPark.dat file to SAS by writing your code. (OPTIONAL but good exercise)Try to use all the methods we have learned so far. Use list input, column input, informats and column pointer in the same INPUT statement.There is no single correct solution for the problem. Use your imagination. The output should look something like this:

Solution will be provided in the Husky CT forums.

INFILE Statement OptionsFIRSTOBS =OBS = MISSOVERTRUNCOVER DLM = DSD

Examples 2.1.5 through 2.1.8Reading Data into SAS (cntd.)CARDS or DATALINES statementsFormat:Data ;input Datalines or cards;

;

17Reading Data into SAS (cntd.)cards or datalines statements

Example:Data animals;input Zooname $ Tigers Lions Monkeys;Cards; San_Diego 7 4 23New_York 11 4 37Orlando 2 8 41;************ OR **********Data animals;input Zooname $ Tigers Lions Monkeys;Datalines; San_Diego 7 4 23New_York 11 4 37Orlando 2 8 41;18WorkIng wIth SAS Data SetsSection 219Temporary vs. Permanent Temporary data setAvailable only for the current sessionImmediately erased when the session is finishedPermanent data setRemains when the job or session is finishedIf you use a data set more than once, it is more efficient to save it as a permanent SAS data set than to create a new temporary SAS data set every time you want to use the data.20SAS Data Set NamesTwo level approachWORK.MYSALES

librefmember name (library reference)

Follows standard SAS naming conventions21Is my data set permanent or temporary?No explicit way to make a data set temporary or permanent.

This information is hidden in where you put your data set. If it is in WORK library, then it is temporary. Else, it is permanent.

This also means that if you dont specify a libname with your data, it will be temporary because it goes to WORK library as WORK library is the default. 22ExampleData StatementLibrefMember nameTypeDATA ironman;WORKironmantemporaryDATA WORK.ironman;WORKironmantemporaryDATA Bikes.ironman;BikesironmanpermanentDATA distance;Miles = 26.22;Kilometers = 1.61 * Miles;RUN;PROC PRINT DATA = distance;RUN;DATA Bikes.distance;Miles = 26.22;Kilometers = 1.61 * Miles;RUN;PROC PRINT DATA = Bikes.distance;RUN;TemporaryPermanent23LIBNAME StatementA libref is a nickname that corresponds to the location of a SAS data library.

Use libname statement to create a libref.Format: LIBNAME libref path to your data library;Example: LIBNAME mySASlib c:\SAS\myrawdata;

You can also define a libref using the New Library window.24Example 2.2.1This program sets up a libref named PLANTS pointing to the BaseData directory. Then it reads the raw data from a file called Mag.dat, creating a permanent SAS data set named MAGNOLIA which is stored in the PLANTS library.

M. grandiflora Southern Magnolia 80 15 E whiteM. campbellii 80 20 D roseM. liliiflora Lily Magnolia 12 4 D purpleM. soulangiana Saucer Magnolia 25 3 D pinkM. stellata Star Magnolia 10 3 D white25Example 2.2.2LIBNAME example "/folders/myfolders/basedata/";PROC PRINT DATA = example.magnolia;TITLE Magnolias;RUN;

Note that libref in this example and the previous example are different, however the location they are referring is the same. So, this code works.

26Example 2.2.3 and Example 2.2.4You can also read into and from any file by direct referencing.

Listing the contents of a data set with PROC CONTENTSFormat: PROC CONTENTS Data = RUN;

PROC CONTENTS is a simple procedure that shows the contents of a data set. It is a procedure that outputs the metadata of the dataset.

PROC Contents Output

Further ReadingOptional: Read The Little SAS Book Chapter 2.12 2.18 for more advanced data parsing methods