Computing Tutorial - asc.ohio-state.edu
Transcript of Computing Tutorial - asc.ohio-state.edu
1
Computing Tutorial
STAT 6510
Winter 2012
1. SAS Overview
After you open SAS, you will note that there are lots and lots of windows for the program:
Explorer: Keeps track of your libraries and files.
Results: Keeps track of your output.
Output: Displays output.
Log: Lets you know what SAS has done.
Editor: A word processor that you use to write programs.
Before we start working with data, we need to get a dataset in SAS. There are many ways of doing this. For this
class, I will discuss two different ways:
Using the Table Editor to enter data by hand
Importing data from a text file (including .txt or .csv)
Once we have the data in hand, we will want to work with it. Although SAS has begun to support point-and-click
type approaches to data analysis (using menus), it is better practice to store everything you do to the dataset in
a program. That way, when you make a mistake (and you will), it is simple to change one small part of what you
have programmed, and re-run the program. If you rely on point-and-click, you will have to point-and-click all
over again. SAS does allow you to save the program commands that are executed when you point-and-click,
which is a good way to learn program commands to include in your overall program. Programs primarily consist
of three types of commands.
SAS Variable definitions (which we will use to specify folder locations)
Data steps are used to read data into SAS, manipulate data, and format data.
Procedure steps (abbreviated proc) are used to perform calculations and produce plots using the data.
Here is an example of a program. It assumes that SAS already knows about the dataset "work.diamondSRS". This
program keeps only the size of the diamonds (carats), sorts them by size, and saves only the large sizes in a new
data set called ``largediamond”. Finally, the program creates a histogram of the large diamond sizes. We will
look more carefully at each statement as we go through the basics of SAS programming.
2
/*Data step that loads the dataset "Work.Diamond" into a new file named
"Work.Large", and keeps only the carat variable;*/
DATA Work.Large;
set Work.Diamond;
keep carat;
RUN;
*Proc step that sorts the data according to carat;
PROC sort data Work.Large;
by carat;
RUN;
*Proc step that makes a histogram using the data;
PROC univariate data=work.large;
histogram carat;
RUN;
2. Getting Data Into SAS
a. SAS Libraries SAS stores all its datasets in folders called “Libraries”. By default, SAS will use the “Work” library, which
is a temporary library. When you quit SAS, everything in the “Work” library is erased. You can create
permanent libraries, which are simply pointers from SAS to folders located in your computer.
For example, I can create a new library called Stat651Examples on my desktop.
i. Create a folder in your computer for the SAS Library. I created a folder called 651Computing
3
ii. Navigate to your current libraries in SAS. To do this, in SAS, double-click on the “Libraries” icon in
the “Explorer” window.
It will look something like this:
4
iii. Add your permanent library. Choose File → New, then enter the name for your library (here I used
Stat651), which can only be 8 characters long, and the pathway to the folder on your computer
(probably by “browsing” to it). Click OK. (Note the name in SAS does not need to match the folder
name!)
iv. Check that this worked. Now the explorer lists a new library.
You need to tell SAS where your library is every time you start a new SAS Session!! An easier way to
do this is to declare the library location in the first line of your SAS program:
LIBNAME Stat651 ’C:\Documents and Settings\651Computing’;
PRACTICE: Create a Library to store your data in.
b. Enter Data By Hand Using the Table Editor i. Choose Tools → Table Editor
5
ii. Click on the heading letters to re-name the variables
iii. Click on the table cells to enter data
iv. Save by choosing File → Save
v. Navigate to the Library where you want to save your data, and enter the name you’d like to give
your data. Here I’m saving the data set as “Diamond” in the “Work” library.
6
c. Importing Data From a Text File i. Choose File → Import Data
ii. Choose the file type (in this case .txt)
iii. Browse for the file location on your computer
7
iv. Choose the LIBRARY and File Name that SAS will use to identify the data set. Since we have not
defined any libraries, we will save our data in WORK for now. I can give the data any name I like.
v. If you want to, you can have SAS write the commands to import this data again without having
to go through the point-and-click importing. If so, you need to give SAS a file name in which to
save these commands. Otherwise, just click “Finish”.
vi. You can use the Explorer to check that you data is saved where you think it is. In this case, I can
choose Libraries → Work → Diamond. Double-clicking on the dataset will open the data in the
Data Viewer.
PRACTICE: Download the DiamondsSRS.txt dataset from the course website. Load it into SAS.
3. SAS programming Basics
As I mentioned earlier, it is best practice to write out a program, rather than “point and click”.
Some things to keep in mind:
Always end each part of your command with a semicolon;
You can write comments to yourself by using:
/*this is a comment*/
Or you can comment an entire line by using:
*this is a comment;
8
SAS does not care about capitalization
Use the command:
RUN;
to tell SAS that you are finished giving it commands, and it should go ahead and do what you
told it to.
SAS has a built-in help menu. Use it as best you can.
Missing values are indicated by a ‘.’
4. DATA Steps Data steps are used to create new datasets out of existing datasets. If you use the same name as an
existing data step, SAS will overwrite your data.
DO NOT RUN:
/*These commands erase all the data stored in the Diamonds dataset in the
Work library, or rather, replaces the current data with an empty data set;*/
Data Work.Diamond;
Run;
Data names consist of two parts, separated by a period. The part before the period specifies the library.
The part after the period specifies the specific name of the data set. Thus, Work.Diamond refers to
the dataset “Diamond” in the library “Work”.
Since we will be building new datasets by changing existing data sets (for example to create a data set
that only contains the carat variable), we need to tell SAS where to get the old datasets. This is what the
SET command is for.
/*Data step that saves the dataset "Work.Diamond" into a new file named
"STAT651.Diamond";*/
DATA STAT651.Diamond; /*name of the new dataset*/
set Work.Diamond; /*name of the current dataset*/
RUN;
Of course, we sometimes like to change the dataset. We might like to create a new variable, or keep
only a subset of variables.
/*Data step that loads the dataset "Work.Diamond" into a new file named
"Work.Money", changes the dollars to cents, and drops the dollar
variable;*/
DATA Work.Money; /*name of the new dataset*/
set Work.Diamond; /*name of the current dataset*/
cents = total_price*100 /*create a new variable out of an older one*/
keep carat cents; /*keep only the specified variables*/
RUN;
9
In this class, it might be useful to only keep a certain kind of observation.
/*Keep only those diamonds with more than 2 carats*/
DATA Work.Large; /*name of the new dataset*/
set Work.Diamond; /*name of the current dataset*/
where carat>= 2; /*keep only the specified observations*/
RUN;
In addition to creating variables that are a simple function of other variables, we might like to categorize
the observations according to one or more variables. One way to do this is with if…then…else
statements:
DATA Work.MoMoney; /*name of the new dataset*/
set Work.Diamond; /*name of the current dataset*/
if carat < 2 then size = 0; /*categorize by carat*/
else if carat < 4 then size = 1;
else size = 2;
/*categorize by carat AND price*/
if (carat < 2 & Total_Price > 6000) then expensive = 1;
else expensive = 0;
RUN;
10
PRACTICE: Create a dataset that contains only the values of diamonds less than 1 carat.
PRACTICE: Create a dataset that contains an indicator of a diamond priced more than $10,000.
5. PROC Steps Proc steps are all about manipulating data that has already been created. There are many reasons to do
so, including exploratory data analysis and statistical analysis. But, regardless of its purpose, PROC
steps have the same general anatomy:
On each line, you might also have options, which are indicated by a -slash ‘/’.
As an example, this code sorts your dataset by the indicated variable:
proc sort data=Work.Diamond;
By Carat;
RUN;
a. Exploratory Data Analysis This section covers only a few commands. We will look at more as we need them
i. Univariate numerical summaries
PROC UNIVARIATE data= Work.Diamond;
VAR carat;
RUN;
ii. Histograms
PROC UNIVARIATE data=Work.Diamond NOPRINT;
/*The NOPRINT option suppresses the univariate summary*/
Histogram carat/ midpoints=0 to 6 by 0.25;
Title ‘Histogram of Diamond Carats’;
RUN;
11
iii. Tables. The following code would create 3 tables, one for the size of the diamond, one
for whether or not the diamond is expensive and a 2-way table of size vs. expensive:
PROC FREQ data=Work.MoMoney;
TABLES size expensive size*expensive;
RUN;
b. Statistical Analysis i. Estimation of a mean from a survey
PROC SURVEYMEANS data=Work.Diamond N=1000;
Var carat;
RUN;
PRACTICE: Describe the distribution of the value of diamonds less than 1 carat.
PRACTICE: Describe the distribution of the indicator of a diamond priced more than $10,000.
PRACTICE: Use finite population estimation (survey estimation) to estimate the proportion of
diamonds priced more than $10,000. Include a 95% CI.
6. Random Number Generation
a. Add random numbers to an existing data set
Data Work.Diamond;
Set Work.Diamond;
Rnum = uniform(-1); /*The minus one tells SAS to choose a random seed.
If you choose a positive number, you will get the same random numbers each
time. (This can be helpful if you are debugging or want to get the same
sample again.)*/
RUN;
b. Create a dataset that only contains random numbers Data Work.randnum;
DO ID=1 to 100 by 1;
Rnum = ceil(uniform(-1) * 1000); /* a random integer between 1 and
1000. (ceil=ceiling)*/
Output;
END;
RUN;
12
/*Check to see if there are any duplicates, by writing any duplicates to a
new dataset*/
proc sort data=work.Randnum;
by Rnum;
RUN;
data work.dupobs;
set work.randnum;
by Rnum;
if ^first.Rnum;
run;
/*Create a dataset with only unique values*/
data work.unique;
set work.randnum;
by Rnum;
if first.Rnum;
run;
PRACTICE: Create a dataset that is a random sample of 10 diamonds.
7. SURVEYSELECT SAS has a procedure that will help you select random samples from populations. The simplest way to select is to
take a simple random sample. The code below takes a simple random sample of size 10 from
proc surveyselect data=work.diamond
method=srs n=10 out = work.diamondsample
seed = 32348340 stats;
run;
PRACTICE: Create a dataset with unique IDs for all 24 students enrolled in this class. Select a simple
random sample of 5 students to be in a group.