STAT 3304/5304 Introduction to Statistical...
Transcript of STAT 3304/5304 Introduction to Statistical...
STAT 3304/5304
Introduction toStatistical Computing
Introduction to SAS
What is SAS?
• SAS (originally an acronym for Statistical
Analysis System, now it is not an acronym
for anything) is a program designed to
perform analysis on large sets of numerical
and character data.
• Pronounced “sass”, not spelled out as three
letters.
• Developed in the early 1970’s at North
Carolina State University.
• In 1976, The SAS Institute Inc., a privately
held corporation was formed. It grew in
popularity and capability and was used in
academic groups.
1
What is SAS?
• SAS can be used without knowing much
about programming but it is also a very
sophisticated language and more can be done
with it.
• SAS was first developed to be a
programming language for statisticians and
data analysts.
• Originally intended for management and
analysis of agricultural field experiments.
2
What is SAS?
• SAS has grown into the world’s largest
privately held software company.
• SAS is now located in Cary, North Carolina.
• It is a world-wide company with business
in Asia, Pacific and Latin America, Europe,
Middle East and Africa.
• SAS also has a good employee retention rate
of 96%. It also is a family oriented company
and is friendly to working women
3
What is SAS?
• SAS is now one of the the most widely used
statistical software.
• Continual product line expansion and
diversification of clientele have resulted in
SAS products being used by over 40,000
customer sites in 50 countries.
• There are 3.5 million users of SAS products.
• Part of the reason for the continual growth
is that the SAS Institute works with the end
user to improve its product.
• It offers solutions for data warehousing, data
mining, data visualization, and applications
development.
4
What is SAS?
• The SAS System is an applications systemthat can be used as
– a statistical package
– a data base management system
– a high level programming language
• An applications system is software that givesyou the tools you need to make the datauseful and meaningful.
• In order to be useful, an applications systemshould give you
– total control of your data,
– facilitate applications that run in morethan one computing environment, and
– accommodate varying skill levels ofpotential users.
5
What is SAS?
• SAS is able to run on a variety of platforms
and SAS is also portable across computing
environments.
• A computing environment is determined by
the HARDWARE and the host OPERATING
SYSTEM running it.
• SAS can be used on IBM mainframes, UNIX
based machines, on personal computers
using Windows.
• “Portability” means that SAS applications:
– Function the same
– Look the same
– Produce the same results
• You can develop SAS applications in one
environment and run them in other
environments without rewriting the programs.
6
Modes for Running SAS
• SAS can be run in a variety of styles, or
‘modes’, depending on what type of
operating system it is being run on. The
modes most often used include:
– Batch Mode:
∗ user writes whole SAS programs, saves
them into a file, then runs SAS from a
command line prompt.
– Interactive Line Mode:
∗ user enters commands line by line in
response to prompts issued by the SAS
System.
7
Modes for Running SAS
• – Interactive window mode (SAS Display
Manager System):
∗ user interacts with SAS through
Windows using pull-down menus,
dialog boxes and icons.
∗ this is the version used on Windows and
Macintosh.
– SAS Enterprise Guide:
∗ SAS Enterprise Guide software runs only
under Windows
∗ It can write SAS code for you through
its extensive menu system.
8
How does SAS work?
• With any body of data, you must perform
four basic tasks to make it useful and mean-
ingful.
– ACCESS – First, you access the data
through the SAS system
– MANAGE – Update, rearrange, combine,
edit, or subset data before analyzing
– ANALYZE – Ranges from simple
descriptive statistics to more advanced or
specialized analyses for econometrics and
forecasting, statistical design, computer
performance evaluation, and operations
research
– PRESENT – Presentation capabilities
range from simple list and tables to
multidimensional plots to elaborate
full-color graphics, both on paper and on
your display.
9
How does SAS work?
• A SAS program is a sequence of statements
executed in order.
• A statement gives information or
instructions to SAS and must be
appropriately placed in the program.
• SAS is very lenient about the format of its
input – statements can be broken up across
lines, multiple statements can appear on a
single line, and blank spaces and lines can be
added to make the program more readable.
• The most effective strategy for learning SAS
is to concentrate on the details of the data
step, and learn the details of each procedure
as you have a need for them.
10
SAS Windows
• There are five basic SAS windows: Results
and Explorer windows, and three
programming windows: Editor, Log, and
Output.
• There are also many other SAS windows
that you may use for tasks such as
getting help, changing SAS system options,
and customizing your SAS session.
• Results: The Results window is like a
table of contents for your Output window;
the results tree lists each part of your results
in an outline form.
• Explorer: The Explorer window gives you
easy access to your SAS files and libraries.
11
SAS Windows
• Editor: The Editor window can use the text
editor to type in, edit, and submit SAS
programs as well as edit other text files such
as raw data files.
• Log: The Log window contains notes about
your SAS session, and after you submit a
SAS program, any notes, errors, or warnings
associated with your program as well as the
program statements themselves will appear
in the Log window.
• Output: If your program generates any
printable results, then they will appear in the
Output window.
12
SAS Windows
• In Windows operating environments, the de-
fault editor is the Enhanced Editor.
• The Enhanced Editor is syntax sensitive andcolor codes your programs making it easierto read them and find mistakes.
– Green: Comments
– Dark Blue: Keywords in major SAS commands
– Blue: Keywords that have special meaning as SAScommands
– Yellow Highlight: Data
– Red: Statements that SAS does not understand
• The Enhanced Editor also allows you to col-lapse and expand the various steps in your
program.
• For other operating environments, the de-fault editor is the Program Editor whosefeatures vary with the version of SAS and
operating environment.
13
General Syntax and Rules
• SAS statements may be in upper or lower
case and may begin on any column.
• SAS statements always end with a semicolon
(;).
• SAS statements may also extend across lines,
and more than one SAS statement may
appear on a single line.
• SAS variable names must be 32 characters
or less, constructed of letters, digits and the
underscore character.
• The first character must be an English letter
(A, B, C, . . ., Z) or underscore ( ). Subse-
quent characters can be letters, numeric dig-
its (0, 1, . . ., 9), or underscores. Characters
such as dashes and spaces are not allowed.
14
General Syntax and Rules
• Its a good idea not to start variable names
with an underscore, because special system
variables are named that way.
• Data set names follow similar rules as vari-
ables, but they have a different name space.
• There are virtually no reserved keywords in
SAS; its very good at figuring things out by
context.
• SAS is not case sensitive, except inside of
quoted strings.
• Missing values are handled consistently in
SAS, and are represented by a period (.).
• Each statement in SAS must end in a semi-
colon (;).
15
General Syntax and Rules
• To make your programs more
understandable, you can insert comments
into your programs.
• Comments are usually used to annotate the
program, making it easier for someone to
read your program and understand what you
have done and why.
• It doesnt matter what you put in your
comments, SAS will not look at it.
• There are two styles of comments you can
use: one starts with an asterisk (*) and ends
with a semicolon (;). The other style starts
with a slash asterisk (/*) and ends with an
asterisk slash (*/).
16
Getting Help
• The bulk of SAS documentation is available
online, at
http://support.sas.com/documentation/onlinedoc/
• A catalog of printed documentation avail-
able from SAS can be found at
http://support.sas.com/publishing/
• Online help: Type help in the SAS display
manager input window.
• Sample Programs, distributed with SAS on
all platforms.
• SAS Institute Home Page:
http://www.sas.com
• SAS Institute Technical Support:
http://support.sas.com/resources/
17
Getting Help
• Searchable index to SAS-L, the SAS mailing
list:
http://www.listserv.uga.edu/archives/sas-l.html
• Michael Friendlys Guide to SAS Resources
on the Internet:
http://www.math.yorku.ca/SCS/StatResource.html#SAS
• Brian Yandells Introduction to SAS:
http://www.stat.wisc.edu/~yandell/software/sas/intro.html
18
Two Parts of a SAS Program
• There are two main components to most
SAS programs
– DATA steps: create SAS data sets, read
in, manipulated and edited data.
– PROC steps: process SAS data sets
(creating reports, graphs, editing data,
sorting data, etc.) and can also create
data sets.
• A typical program starts with a DATA step
to create a SAS data set and then passes
the data to a PROC step for processing.
• For example: Raw data and/or a pre-existing
SAS data set are read into a SAS DATA
step, turned into a SAS data set, altered
or analyzed by a PROC step and then the
results are displayed in a report.
19
DATA steps: Getting data into a SAS
There are three ways of getting data into a SAS
data set.
1. Including the data in the SAS command stream
– The data are like a card deck placed into
the stream of SAS commands.
– Use an INPUT command to list the
variables and a CARDS statement right
before the data to be read in.
– Example:
DATA CARDSIN;
INPUT IDNUM SEX AGE;
CARDS;
1 1 25
2 2 33
4 1 55
20
DATA steps: Getting data into a SAS
2. Read the data in from a disk file.
– Use the INFILE command to name the
disk area with the data
– Then use the INPUT command to list the
variables.
– Example:
DATA DISKIN;
INFILE ‘RAWDATA.DAT’;
INPUT IDNUM SEX AGE;
21
DATA steps: Getting data into a SAS
3. Create a new data set from an existing SAS
data set.
– Here, the SET command is used to name
the existing SAS data set.
– Example: creates two new SAS data sets
from an existing SAS data set:
DATA FATHERS MOTHERS; SET DISKIN;
IF SEX=1 THEN OUTPUT FATHERS;
ELSE OUTPUT MOTHERS;
22
PROC steps: Data Management
• PROC SORT
Sorts a data set by one or more variables.
PROC SORT; BY ID; will sort the data set by
the values of the variable ID.
• PROC CONTENTS
Displays the contents of the data set.
• PROC DATASETS
Manages SAS data set libraries.
• PROC RANK
Rank orders one or more variables.
• PROC STANDARDIZE
Rescales variables to a specified mean and/or
standard deviation.
23
PROC steps: Data Management
• PROC SCORE
Generates linear scores for certain procedures
like factor analysis and discriminant analysis.
• PROC TRANSPOSE
Transposes a data set.
24
PROC steps: Descriptive Statistics
• PROC FREQ
Simple frequencies and contingency tables
for categorical variables.
• PROC MEANS
Number of observations, mean, standard
deviation, and minimum and maximum
values for continuous variables.
• PROC UNIVARIATE
More detailed descriptive statistics for
continuous variables.
• PROC TABULATE
Produces tables of frequencies and/or
descriptive statistics.
25
PROC steps: Descriptive Statistics
• PROC SUMMARY
Descriptive statistics broken down by groups;
particularly useful for generating a data set
of descriptive statistics for input into other
procedures.
• PROC CORR
Parametric and nonparametric correlations.
26
PROC steps: Regression
• PROC REG
General purpose linear regression and
multivariate regression.
• PROC GLM
General linear models, including regression,
analysis of variance/covariance, and
multivariate analysis of variance/covariance.
• PROC RSQUARE
All possible subsets of regression.
• PROC RSREG
Quadratic response surface regression.
• PROC LOGISTIC
Logistic regression.
• PROC PROBIT
Probit regression.
27
PROC steps: ANOVA, Graphics
• Analysis of Variance
– PROC ANOVA
Analysis of variance for orthogonal data.
– PROC GLM
General linear models, including
regression, analysis of variance, and
multivariate analysis of variance.
– PROC NESTED
Nested analysis of variance.
– PROC VARCOMP
Variance components.
• Low Resolution Graphics
– PROC CHART
Pie, bar, and star charts.
– PROC PLOT
Two dimensional plots.
28
PROC steps: Multivariate Analysis
• Discriminant Analysis
– PROC DISCRIM
General purpose parametric and
nonparametric discriminant analysis.
– PROC CANDISC
Canonical discriminant analysis.
• Principal Components and Factor Analysis
– PROC PRINCOMP
Principal components.
– PROC FACTOR
Factor analysis.
29
PROC steps: Multivariate Analysis
• Cluster Analysis
– PROC CLUSTER
Clustering observations.
– PROC FASTCLUS
Disjoint clustering for large data sets.
– PROC VARCLUS
Clustering variables.
• Survival Analysis
– PROC LIFETEST
Nonparametric and life tables.
– PROC LIFEREG
Parametric survival analysis.
30