Lies, damn lies and ... SAS to the rescue!
Peter L. Flom
Peter Flom Consulting
SESUGSeptember, 2015
Broad outline
1 Introduction2 Descriptive statistics3 Descriptive graphics4 Inferential statistics5 The regression family6 Multivariate statistics
Part I
Introduction
Schedule
8:00 Descriptive statistics8:40 Break8:50 Descriptive graphics9:30 Break9:40 Inferential statistics
10:10 Break10:20 The regression family11:00 Break11:10 Multivariate statistics12:00 End
Introductions of participants and self
What I plan to do in this course
Give you a fundamental understanding of some basicstatistical methodsGive you a very brief survey of a lot of many moreadvanced methodsHelp you learn to work with statistics and statisticiansGive some SAS code you can give to othersNote that the most important stuff is at the beginning, soask questions!
What I don’t plan to do in this course
Teach you to be a statisticianTeach you SAS
What I want from you
AttentionQuestions - after all, it’s not even gradedFeedback - even anonymous is OK, after the course
Part II
Descriptive statistics
Introduction
Outline
1 Introduction
2 Measures of central tendency
3 Measures of spread
4 Other measures
Introduction
Descriptive vs. inferential stats
Descriptive statistics describe a variable or a sample.Inferential statistics let you infer from a sample to apopulation (more later)Descriptive statistics are necessary even when your goal isinference
Introduction
Types of descriptive statistics
For continuous variables descriptive statistics includeMeasures of central tendencyMeasures of dispersion or spreadMeasures of skewnessMeasures of kurtosisOther measures
For categorical variables, mostly we are limited to frequencies.
Measures of central tendency
Outline
1 Introduction
2 Measures of central tendency
3 Measures of spread
4 Other measures
Measures of central tendency
The mean
What it is
Definition
The mean is the ordinary average. Add up the numbers anddivide by the number of numbers.
Or, if you want a formula
x̄ =
n∑i=1
xi
n
where x is the variable and there are n values of the variable.
Measures of central tendency
The mean
What can go wrong
OutliersSkewnessThe clock problemThe rate problemDifferent scales
Measures of central tendency
The mean
Mean salary
proc means data = sashelp . baseba l l maxdec = 2;var sa la ry ;
run ;
Analysis Variable : Salary 1987 Salary in $ ThousandsN Mean Std Dev Minimum Maximum
263 535.93 451.12 67.50 2460.00
Measures of central tendency
The mean
Alternatives
1 The median2 The trimmed mean and Winsorized mean3 The geometric mean4 The harmonic mean
which are the topics of the next few slides
Measures of central tendency
The median
Median salary
The median is simply the value that divides the distribution inhalf - half are lower, half are higher.
ods s e l e c t BasicMeasures ;proc u n i v a r i a t e data = sashelp . baseba l l ;
var sa la ry ;run ;
Basic Statistical MeasuresLocation Variability
Mean 535.9259 Std Deviation 451.11868Median 425.0000 Variance 203508Mode 750.0000 Range 2393
Interquartile Range 560.00000
Measures of central tendency
The median
What can go wrong
Sometimes we want the outliersWhen there are many ties, the median may not becompletely determined.
Measures of central tendency
The trimmed mean and Winsorized mean
What it is
A compromise between the mean and the median.To calculate the trimmed mean, you remove a certainpercentage of the highest and lowest points and then findthe mean of what remains.The Winsorized mean is similar but, rather than deletingthe points, you set them equal to the lowest or highestvalues that are not extreme.
Measures of central tendency
The trimmed mean and Winsorized mean
What can go wrong
If the distribution is skewed, the trimmed mean is not anunbiased estimator for either the mean or median.
Measures of central tendency
The trimmed mean and Winsorized mean
Trimmed and winsorized mean salary
ods s e l e c t TrimmedMeans WinsorizedMeans ;proc u n i v a r i a t e data = sashelp . baseba l l
trimmed = .1 winsor ized = . 1 ;var sa la ry ;
run ;
Trimmed per tail% N
Trimmedmean
SEWinsorizedmean
SE
10.27 27463.89
25.08486.04
25.09
Measures of central tendency
The geometric mean
What it is
Definition
It’s like the mean, except instead of adding the numbers andthen dividing by the count, you multiply the numbers and takethe nth root of the product
or, if you want a formula (n∏
i=1xi
)1/N
Measures of central tendency
The geometric mean
What can go wrong
Doesn’t work when any value is 0 or negative
Measures of central tendency
The geometric mean
When to use it
Useful for combining measures on different scales. E.g.Candidates for college - combine SAT (0 to 1600) and HSGPA (0 to 4)Proportional growth over a series of times
Measures of central tendency
The geometric mean
Geometric mean of college applicants
data co l l ege ;i npu t name $ GPA SAT @@;d a ta l i n es ;J i l l 3.0 1550 Joe 4.0 1500
;data co l l ege ; set co l l ege ;gmean = geomean (GPA, SAT ) ;amean = mean(GPA,SAT ) ;
run ;proc p r i n t data = co l l ege ; run ;
Obs name GPA SAT gmean amean1 Jill 3 1550 68.1909 776.52 Joe 4 1500 77.4597 752.0
Measures of central tendency
The harmonic mean
Harmonic mean of round trip travel
Definition
It is the reciprocal of the arithmetic mean of the reciprocals of aset of numbers.
H = n1
x1+ 1
x2+... 1
xn
Measures of central tendency
The harmonic mean
When to use it
Averaging rates, such as speeds or batting averagesAveraging ratios such as price earning ratios
Measures of central tendency
The harmonic mean
What can go wrong
Like the geometric mean, it doesn’t work with negative numbersor 0’s.
Measures of central tendency
The harmonic mean
SAS code
data speed ;i npu t To From @@;
d a ta l i n es ;50 80 40 70
;data speed ; set speed ;hmean = harmean ( to , from ) ; amean = mean( to , from ) ;t ime = 100 / to + 100 / from ; actualspeed= 200 / t ime ; run ;
proc p r i n t data = speed ; run ;
Obs To From hmean amean time actualspeed1 50 80 61.54 65 3.25 61.542 40 70 50.91 55 3.93 50.91
Measures of central tendency
Exercises
Exercises
1 Name 3 variables for which the mean would not beappropriate
2 For each of those, decide which measure of centraltendency would be appropriate and why?
Measures of spread
Outline
1 Introduction
2 Measures of central tendency
3 Measures of spread
4 Other measures
Measures of spread
Standard deviation
What it is
Definition
The standard deviation is the square root of the averagesquared difference between the mean and the individual values.
Or
s =
√n∑
i=1(xi−x̄)2
n−1
Measures of spread
Standard deviation
What can go wrong
If the mean isn’t a good measure of central tendency, the sdisn’t a good measure of spread.
Measures of spread
Standard deviation
SD of salary
proc means data = sashelp . baseba l l ;var sa la ry ;
run ;
Basic Statistical MeasuresLocation Variability
Mean 535.9259 Std Deviation 451.11868Median 425.0000 Variance 203508Mode 750.0000 Range 2393
nterquartile Range 560.00000
Measures of spread
Standard deviation
Alternatives
Median absolute deviation (MAD)Range and interquartile rangeMore quantilesGini’s mean differenceVariations on MAD(also see graphics, later)
Measures of spread
MAD
What it is
Definition
The median absolute deviation is what it says:1 Find the median2 Find each value’s deviation from the median3 Take absolute values4 Find the median of those
Measures of spread
MAD
What can go wrong?
Not very efficientNot appropriate with asymmetric distributions
Measures of spread
Range and interquartile range
What it is
The range is just the smallest to largest valueThe IQR is the 1st quartile to the 3rd quartile
Measures of spread
Range and interquartile range
What can go wrong
The range is strongly affected by even a single outlierThe IQR is not affected at all by outliers
Measures of spread
Range and interquartile range
SAS code
ods s e l e c t RobustScale ;proc u n i v a r i a t e data = sashelp . baseba l l RobustScale ;
var sa la ry ;run ;
Robust Measures of ScaleMeasure Value Estimate of SigmaInterquartile Range 560.0000 415.1285Gini’s Mean Difference 468.0400 414.7897MAD 275.0000 407.7150Sn 381.6320 382.9424Qn 327.7303 325.9949
Measures of spread
Range and interquartile range
Exercises
List 3 variables that would not be well analyzed by the SDand suggest alternatives.
Other measures
Outline
1 Introduction
2 Measures of central tendency
3 Measures of spread
4 Other measures
Other measures
Skewness
What it is
Definition
Skewness is the asymmetry of the distribution.
1n
n∑i=1
(xi−x̄)3
[ 1n−1
n∑i=1
(xi−x̄)2]3/2
Skewness can take on any number, negative means left skew,positive means right skew, 0 means symmetrical.
Other measures
Skewness
Alternatives and problems
One good way to look at skewness is with density plots (to becovered later)What can go wrong
A single outlier can generate skewness.Again, if the mean is not an appropriate measure of centraltendency, this is not an appropriate measure of skew
Other measures
Skewness
Skewness of salary
ods s e l e c t Moments ;proc u n i v a r i a t e data = sashelp . baseba l l ;
var sa la ry ;run ;
MomentsN 263 Sum Weights 263Mean 535.93 Sum Observations 140948.507Std Deviation 451.12 Variance 203508.064Skewness 1.59 Kurtosis 3.05896473Coeff Variation 84.18 Std Error Mean 27.82
Other measures
Kurtosis
What it is
It is a measure of the peakedness of the distribution. However,it is very nonintuitive and hard to interpret. It can be used toindicate a non-normal distribution, but its use beyond that istricky (and confuses even experienced people).Better to use graphical measures such as density plots (to becovered later)
Other measures
Exercises and further reading
Exercises
List 3 variables that are markedly skewed, either to the right orleft
Other measures
Exercises and further reading
Discussion
Other measures
Exercises and further reading
Further reading
www.statisticalanalysisconsulting.com/how-to-go-wrong-with-the-mean/
www.statisticalanalysisconsulting.com/measures-of-central-tendency-the-harmonic-mean/
www.statisticalanalysisconsulting.com/measures-of-central-tendency-the-harmonic-mean/
www.statisticalanalysisconsulting.com/measures-of-central-tendency-the-trimmed-mean-and-median/
www.statisticalanalysisconsulting.com/statistical-measures-of-spread/
Exploratory Data Analyis by John Tukey
Part III
Descriptive graphics
Introduction
Outline
5 Introduction
6 Univariate graphics
7 Bivariate graphics
8 Trivariate and multivariate graphics
9 Time series data
10 Exercises and further reading
Introduction
General thoughts on statistical graphics - 1
A good graph willShow the dataInduce the viewer to think about the substance of the dataAvoid distorting the dataPresent many numbers in a small space
Introduction
General thoughts on statistical graphics - 2
Make large data sets coherentEncourage the eye to look at different parts of the dataReveal several levels of detailServe a clear purpose
Introduction
General thoughts on statistical graphics - 3
But a good graphic will notBe a substitute for a tableBe a substitute for a model
Introduction
General thoughts on statistical graphics - 4
Use of color, shape and so onConsider the audienceNot all chart junk is bad
Univariate graphics
Outline
5 Introduction
6 Univariate graphics
7 Bivariate graphics
8 Trivariate and multivariate graphics
9 Time series data
10 Exercises and further reading
Univariate graphics
Univariate discrete data
Introduction
This is usually counts or proportions of something, e.g. numberof Democrats, Republicans and others. Here:
Pie charts should be avoidedDot charts are often goodA table may be even betterLog scales are sometimes helpful
Univariate graphics
Univariate discrete data
Pie chart with 51 categories - a mess
Geographical_Area Alabama Alaska Arizona ArkansasCalifornia Colorado Connecticut DelawareDistrict of Columbia Florida Georgia HawaiiIdaho Illinois Indiana IowaKansas Kentucky Louisiana MaineMaryland Massachusetts Michigan MinnesotaMississippi Missouri Montana NebraskaNevada New Hampshire New Jersey New MexicoNew York North Carolina North Dakota OhioOklahoma Oregon Pennsylvania Puerto RicoRhode Island South Carolina South Dakota TennesseeTexas Utah Vermont VirginiaWashington West Virginia Wisconsin Wyoming
Univariate graphics
Univariate discrete data
Dot chart with 51 categories
California
Texas
New York
Florida
Illinois
Pennsylvania
Ohio
Michigan
Georgia
North Carolina
New Jersey
Virginia
Washington
Arizona
Massachusetts
Indiana
Tennessee
Missouri
Maryland
Wisconsin
Minnesota
Colorado
Alabama
South Carolina
Louisiana
Kentucky
Puerto Rico
Oregon
Oklahoma
Connecticut
Iowa
Mississippi
Arkansas
Kansas
Utah
Nevada
New Mexico
West Virginia
Nebraska
Idaho
Maine
New Hampshire
Hawaii
Rhode Island
Montana
Delaware
South Dakota
Alaska
North Dakota
Vermont
District of Columbia
Wyoming
Sta
te
1 5 10 15 20 25 30 35
Population (millions, log scale)
MidwestNortheastSouthWestregion
Univariate graphics
Univariate discrete data
Pie chart with 9 categories - a table or dot chart
East North Central Division46395654
East South Central Division18084651
Middle Atlantic Division40621237
Mountain Division21784507
New England Division14303542
Pacific Division49070441
South Atlantic Division58398377
West North Central Division20165794
West South Central Division35235521
Univariate graphics
Univariate discrete data
Pie chart with 4 categories - a table or text
Geographical_Area Midwest Region Northeast Region South Region West Region
Univariate graphics
Univariate discrete data
SAS code
The SAS code for the pie charts isn’t shown because youshouldn’t use it. That for the dot plot is complex, I can e-mail itto you if you want.
Univariate graphics
Univariate continuous data
Introduction
Histograms can be misleading, at least if they areunadornedDensity plots are often better, and several smooths can beused.Box plots provide a useful summaryWhen N is small, strip charts can be useful
Univariate graphics
Univariate continuous data
Density plot - example
13:27 Thursday, August 20, 2015 113:27 Thursday, August 20, 2015 1
-1000 0 1000 2000 3000
1987 Salary in $ Thousands
0.0000
0.0003
0.0005
0.0008
0.0010
0.0013
De
nsity
Kernel, c=2Kernel, c=0.5Kernel
Density plot, salaries
Univariate graphics
Univariate continuous data
Density plot - SAS code
proc sgp lo t data = sashelp . baseba l l ;dens i t y sa la ry / type = kerne l ;dens i t y sa la ry / type = kerne l ( c = . 5 )
c u r v e l a b e l a t t r s = ( co lo r = red ) ;dens i t y sa la ry / type = kerne l ( c = 2)
c u r v e l a b e l a t t r s = ( co l o r = green ) ;xax is min = 0;
run ;
Univariate graphics
Univariate continuous data
Box plot - example
21:40 Friday, August 7, 2015 121:40 Friday, August 7, 2015 1
0
500
1000
1500
2000
2500
19
87
Sa
lary
in $
Th
ou
san
ds
Density plot, salaries
proc sgp lo t data =sashelp . baseba l l ;vbox sa la ry ;
run ;
Univariate graphics
Univariate continuous data
Box plot - example, log scale
250
500
750
1250
1750
2500
1987
Sal
ary
in $
Tho
usan
ds
Salary by division
proc sgp lo t data =sashelp . baseba l l ;vbox sa la ry ;yax is type = loglogbase = 10 l o g s t y l e = l i n e a r ;
run ;
Univariate graphics
Univariate continuous data
Strip plot - example
0.90 0.95 1.00 1.05 1.10
jitter
0
500
1000
1500
2000
2500
Tho
usan
ds19
87 S
alar
y in
$
Salary strip plot
Univariate graphics
Univariate continuous data
Strip plot - SAS code
data s t r i p ;se t sashelp . baseba l l ;j i t t e r = 1∗ ( ranun i (1234) / 5) + . 9 ;
run ;t i t l e " Salary s t r i p p l o t " ;proc sgp lo t data = s t r i p ;
s c a t t e r x = j i t t e r y = sa la ry ;xax is min = 0 max = 2 d i sp lay = none ;
run ;
Bivariate graphics
Outline
5 Introduction
6 Univariate graphics
7 Bivariate graphics
8 Trivariate and multivariate graphics
9 Time series data
10 Exercises and further reading
Bivariate graphics
Both categorical
Mosaic plots
A little known and under-used plot is the mosaic plot. It is a wayof visualizing a crosstabulation. For example, sex and party ID.
The SAS System 08:20 Monday, July 6, 2015 1
The FREQ Procedure
The SAS System 08:20 Monday, July 6, 2015 1
The FREQ Procedure
ods s e l e c t MosaicPlot ;proc f req data = mosaic ;
t ab l e pa r t y ∗sex /p l o t s = mosaic ;weight count ;
run ;
Bivariate graphics
One categorical
Introduction
When N is relatively small, a strip chart is good - it shows all thedata. When N is larger, a parallel boxplot shows a lot of the keyinformation.
Bivariate graphics
One categorical
Parallel boxplot - example
AE AW NE NW
League and Division
0
500
1000
1500
2000
2500
1987
Sal
ary
in $
Tho
usan
ds
Salary by division
t i t l e " Salary by d i v i s i o n " ;proc sgp lo t data =
sashelp . baseba l l ;vbox sa la ry
/ category = d iv ;run ;
Bivariate graphics
One categorical
Strip chart - example
22:14 Friday, August 7, 2015 122:14 Friday, August 7, 2015 1
80 100 120 140 160
1987 Salary in $ Thousands
AE
NE
AW
NW
Leag
ue a
nd D
ivis
ion
Salary by division, rookies
t i t l e " Salary by d i v i s i o n ,rook ies " ;
proc sgp lo t data =sashelp . baseba l l ;s c a t t e r x = sa la ryy = d iv ;
where yrmajor l e 1 ;run ;
Bivariate graphics
Neither categorical
The scatter plot
The most common (and one of the best) basic options here isthe scatter plot. But there are variations.
Bivariate graphics
Neither categorical
Scatter plot - basic example
0 1000 2000 3000 4000
Career Hits
0
500
1000
1500
2000
2500
1987
Sal
ary
in $
Tho
usan
ds
Salary by division
proc sgp lo t data =sashelp . baseba l l ;s c a t t e r x = CrHi tsy = Salary ;
run ;
Bivariate graphics
Neither categorical
Scatter plot - log scale
100 500 1000 2000 4000
Career Hits
250
500
750
12501750
2500
1987
Sal
ary
in $
Tho
usan
ds
Salary by division
proc sgp lo t data =sashelp . baseba l l ;
s c a t t e r x = CrHi tsy = Salary ;
xax is type = logl o g s t y l e = l i n e a r ;
yax is type = logl o g s t y l e = l i n e a r ;
run ;
Bivariate graphics
Neither categorical
Scatter plot - log scale plus loess
100 500 1000 2000 4000
Career Hits
100
500
10001500
2500
1987
Sal
ary
in $
Tho
usan
ds
Loess1987 Salary in $ Thousands
Salary by division
Bivariate graphics
Neither categorical
Scatter, log scale with loess
proc sgp lo t data = sashelp . baseba l l ;xax is l a b e l = " Career h i t s
( log scale ) "type = log l o g s t y l e = l i n e a r ;
yax is l a b e l = " Salary i n thousands of$ ( log scale ) "
type = log l o g s t y l e = l i n e a r ;s c a t t e r x = CrHi ts y = sa la ry ;loess x = CrHi ts y = Salary / nomarkers ;e l l i p s e x = CrHi ts y = Salary ;
run ;
Bivariate graphics
Neither categorical
Scatter plot - A fancy example
The SAS SystemThe SAS System
Scatter plot with density plots
AL
AK
AZ
AR
CA
COCT
DE
DC
FL
GA
HI
IDIL
IN
IA
KS
KY
LA
ME
MD
MA
MI
MN
MS
MO
MTNE
NVNH
NJ
NM NY
NC
ND
OH
OK
OR
PA
RI
SC
SD
TN
TX
UTVT
VA
WA
WV
WI
WY
Prediction ellipse (α=.05)
AL
AK
AZ
AR
CA
COCT
DE
DC
FL
GA
HI
IDIL
IN
IA
KS
KY
LA
ME
MD
MA
MI
MN
MS
MO
MTNE
NVNH
NJ
NM NY
NC
ND
OH
OK
OR
PA
RI
SC
SD
TN
TX
UTVT
VA
WA
WV
WI
WY
Prediction ellipse (α=.05)
2 4 6 8 10
Unemployment (%)
0.0 0.1 0.2
Density
4
6
8
10
12
Infa
nt M
ort
alit
y (p
er
XX
X)
0.0
0.1
0.2
De
nsity
Bivariate graphics
Neither categorical
Scatter plot - Another fancy example
The SAS SystemThe SAS System
Box plot w/barchart
120
140
160
180
200
220
Weight
0
100000
200000
N
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 79
ht
Trivariate and multivariate graphics
Outline
5 Introduction
6 Univariate graphics
7 Bivariate graphics
8 Trivariate and multivariate graphics
9 Time series data
10 Exercises and further reading
Trivariate and multivariate graphics
All continuous
The scatterplot matrix - example
Salary by division
0 1000 2000
0 100 200 300 400 500
0 1000 2000 3000 4000
0
500
1000
1500
2000
2500
0
100
200
300
400
500
0
1000
2000
3000
4000
1987 Salary in $Thousands
Career Home Runs
Career Hits
proc sgsca t te r data =sashelp . baseba l l ;ma t r i x CrHi ts CrHome
Salary ;run ;
Trivariate and multivariate graphics
All continuous
The scatterplot matrix - a more complex example
12:34 Monday, August 17, 2015 112:34 Monday, August 17, 2015 1
proc sgsca t te r data =sashelp . baseba l l ;ma t r i x CrHi ts CrHomeCrBB CrRbi /markera t t r s = ( symbol =c i r c l e f i l l e d s ize = 8)d iagonal = ( ke rne l )e l l i p s ecolorresponse = sa la ry ;
run ;
Trivariate and multivariate graphics
All continuous
Bubble plot
50 75 100 125 150 175
Career Hits
0
5
10
15
20
25
30
Car
eer
Hom
e R
uns
Bubble plot of rookie salaries
t i t l e " Bubble p l o to f rook ie s a l a r i e s " ;
proc sgp lo t data =sashelp . baseba l l ;bubble x = CrHi ts
y = CrHomes ize = sa la ry ;
where yrmajor l e 1 ;run ;
Trivariate and multivariate graphics
Some continuous
Coplot
08:30 Sunday, July 5, 2015 108:30 Sunday, July 5, 2015 1
Career Hits
Ca
ree
r H
om
e R
uns
League and Division = NWLeague and Division = NE
League and Division = AWLeague and Division = AE
0 1000 2000 3000 40000 1000 2000 3000 4000
0
100
200
300
400
500
0
100
200
300
400
500
proc sgpanel data =sashelp . baseba l l ;panelby d iv ;s c a t t e r x = CrHi ts
y = CrHome ;run ;
Trivariate and multivariate graphics
Some continuous
Scatter plot matrix with group variable - example
Several statistics by league and division - 5 years or less
NWNEAWAELeague and Division
Career Times at Bat1987 Salary in $ Tho...Career Home RunsCareer Hits
Ca
ree
r T
ime
s a
t Ba
t1
98
7 S
ala
ry in
$ T
...C
are
er
Ho
me
Run
sC
are
er
Hits
proc sgsca t te r data =sashelp . baseba l l ;
t i t l e " Several s t a t i s t i c sby league and d i v i s i o n− 5 years or less " ;
mat r i x CrHi ts CrHomeSalary CrAtBat / group= d iv d iagonal =
( kerne l ) ;where yrmajor l e 5 ;
run ;
Trivariate and multivariate graphics
Some continuous
Scatter plot matrix - another example
12:34 Monday, August 17, 2015 112:34 Monday, August 17, 2015 1
proc sgsca t te r data =sashelp . baseba l l ;p l o t ( sa la ry ) ∗( nH i ts nHome NBB nAssts )/ markera t t r s = ( symbol =
c i r c l e f i l l e d s ize = 8)loesscolorresponse = yrmajorcolormodel = twocolorramp ;
run ;
Trivariate and multivariate graphics
None continuous
Introduction
When all variables are categorical, generalizations of themosaic plot can be used.
Time series data
Outline
5 Introduction
6 Univariate graphics
7 Bivariate graphics
8 Trivariate and multivariate graphics
9 Time series data
10 Exercises and further reading
Time series data
Electrical workers over time
Timeseries decomposition 14:24 Monday, August 3, 2015 1
The TIMESERIES Procedure
Timeseries decomposition 14:24 Monday, August 3, 2015 1
The TIMESERIES Procedure
240
260
280
300
320
ele
ctri
cal w
ork
ers
, tho
usa
nds
Jan Jul Jan Jul Jan Jul Jan Jul Jan Jul Jan Jul1977 1978 1979 1980 1981 1982
DATE
Series Values for ELECTRIC
Time series data
Electrical workers over time
Timeseries decomposition 14:24 Monday, August 3, 2015 1
The TIMESERIES Procedure
Timeseries decomposition 14:24 Monday, August 3, 2015 1
The TIMESERIES Procedure
Seasonal Decomposition/Adjustment for ELECTRIC
Jan Jan Jan Jan Jan Jan1977 1978 1979 1980 1981 1982
240
260
280
300
320
Sea
sona
lly A
djus
ted
Jan Jan Jan Jan Jan Jan1977 1978 1979 1980 1981 1982
0.97
0.98
0.99
1.00
1.01
1.02
1.03
Irre
gula
r
Jan Jan Jan Jan Jan Jan1977 1978 1979 1980 1981 1982
0.925
0.950
0.975
1.000
1.025
1.050
Sea
sona
l-Irr
egul
ar
Jan Jan Jan Jan Jan Jan1977 1978 1979 1980 1981 1982
240
260
280
300
320
Tre
nd-C
ycle
Time series data
Electrical workers over time - SAS code
t i t l e " Timeser ies decomposit ion " ;proc t imeser ies data=sashelp . workers out=_ n u l l _
p l o t s =( se r i es decomp ) ;i d date i n t e r v a l =month ;var e l e c t r i c ;
run ;
Exercises and further reading
Outline
5 Introduction
6 Univariate graphics
7 Bivariate graphics
8 Trivariate and multivariate graphics
9 Time series data
10 Exercises and further reading
Exercises and further reading
Describe a set of variables and say what graph you woulduse for it and why
Exercises and further reading
Discussion
Exercises and further reading
Further reading - blog links
Parallel box plots http://www.statisticalanalysisconsulting.com/graphics-for-bivariate-data-parallel-box-plots/
Pie is delicious but not nutritious http://www.statisticalanalysisconsulting.com/graphics-for-univariate-data-pie-is-delicious-but-not-nutritious/
Scatterplotshttp://www.statisticalanalysisconsulting.com/scatterplots-and-enhancements/
Graphics: The good, the bad and the ugly http://www.statisticalanalysisconsulting.com/graphics-the-good-the-bad-and-the-ugly/
Exercises and further reading
Further reading - books
Creating more effective graphs by Naomi RobbinsVisualizing data by William S. ClevelandThe elements of graphing data by William S. ClevelandA trout in the milk by Howard Wainer
Part IV
Inferential statistics
From sample to population
A population is the entire set of all the subjects (people orwhatever) that you want to study.A sample is a subset of that population.A random sample is a sample where all subjects have adefinable chance of being selected
Null and alternative hypotheses
The null hypothesis is usually "nothing is going on"The alternative is "something is going on"Trial analogy
What is a p value?
Definition
If, in the population from which this sample was randomlydrawn, the null was strictly true, what is probability of getting atest statistic at least as large as the one we got in a sample thesize of the one we have?
In other words, if we do 1000 really silly things, what proportionwill come out significant?
Experiments vs. observational studies
In an experiment subjects are randomly selected and thenrandomly assigned to a conditionIn an observational study neither of these are trueSome people use quasi-experiment where one of theabove is true
Problems
Not usually the question we want to askStrongly affected by sample size
The Bayesian approach
IdeaSet a prior - often a uniform priorLet data modify it.
AdvantagesMore intuitiveLets you have a prior
DisadvantagesHard to set a priorUninformed prior usually gives similar results to frequentistapproachStill not the question we are interested in
What we want
Effect sizes and measure of their accuracyRisk reward analysis
Further reading
The Insignificance of Statistical Significance Testing byDouglas JohnsonThe Cult of Statistical Significance by Stephen Zilliak andDeirdre McCloskey
Part V
The regression family
Introduction
Outline
11 Introduction
12 The OLS model
13 Other models for continuous DV
14 The logistic family
15 Count models
16 Multilevel models
17 Exercises and further reading
Introduction
What is regression?
Regression is a term for a variety of models relating dependentvariables (usually just one) to one or more independentvariables.
Introduction
Varieties of regression
The type of regression depends on the nature of the dependentvariable and on the nature of the relationships.
Continuous - OLS and alternatives (see below)Dichotomous - LogisticCategorical (>2 levels) - Multinomial logisticOrdinal - ordinal logisticCount - Poisson, negative binomial and variationsTime to event - survival models
The OLS model
Outline
11 Introduction
12 The OLS model
13 Other models for continuous DV
14 The logistic family
15 Count models
16 Multilevel models
17 Exercises and further reading
The OLS model
What it is
Ordinary least squares is the most common regression modeland it is what people mean when they say ‘regression‘.The model is Y = b0 + b1x1 + b2x2 + ...bpxp + e where e is errorand is normally distributed with 0 mean and constant variance.
The OLS model
What can go wrong
OverfittingNonlinear fitsNonnormal residualsDependent dataCollinearity
Other models for continuous DV
Outline
11 Introduction
12 The OLS model
13 Other models for continuous DV
14 The logistic family
15 Count models
16 Multilevel models
17 Exercises and further reading
Other models for continuous DV
Introduction
Multivariate adaptive regression splines (MARS) - PROCADAPTIVEREGQuantile regression - PROC QUANTREGTranformations - PROC TRANSREG
More information: See my paper at SGF 2015.
Other models for continuous DV
MARS
Introduction
MARS models allow extremely flexible curves (called splines) tobe fit to data.MARS models are most useful
In high dimensional spacesWhen there is little substantive reason to assume linearityor a low-level polynomial fit
Other models for continuous DV
MARS
Advnatages and disadvantages of MARS models
Advantages of MARS models:Very flexible fitting of the relationship between independentand dependent variablesModel selection methods that can sharply reduce thedimension of the model.SAS implementation of these models extends them todependent variables in the exponential family.Can be more accurate than GLM, with greater parsimony
Disadvantages of MARS models:Hard to interpretLess familiar
Other models for continuous DV
MARS
Example
I modeled baseball salary as a function of various attributes ofthe players. ADAPTIVEREG got a significantly higher R2 withconsiderably fewer terms. But the result is very hard tointerpret.
proc adapt ivereg data = sashelp . baseba l lp l o t s = a l l d e t a i l s = bases ;
c lass team ;model sa la ry = YrMajor nAtBat nHi ts nHome nOuts ;
run ;
Other models for continuous DV
Quantile regression
Introduction
There are at least three motivations for quantile regression:DV is bimodal or multimodalHighly skewed DVSubstantive interest in the quantiles
Advantages include:No assumptions about the distribution of the residualsMore flexible hypotheses
Diadvanages include:Not as powerful as OLS regression when that isappropriate modelNot robust to high leverage points.
Other models for continuous DV
Quantile regression
Example
A quantile regression of baseball salary:
proc quantreg data = sashelp . baseba l l p l o t s = a l l ;model sa la ry = YrMajor nAtBat nHi ts nHome nOuts /
q u a n t i l e = ( 0 . 1 , 0 .5 , 0 . 9 ) ;run ;
revealed that the relationship between salary and variousplayer attributes was different at different levels of salary. e.g.:
Number of home runs was more important at high levels ofsalary.
but this should be viewed with caution because of high leveragepoints.
Other models for continuous DV
TRANSREG
Introduction
Sometimes it makes sense to transform one or morevariables.Can do in data step butPROC TRANSREG offers many options and allowsautomation of some tasksSome transformations (e.g. splines) are hard or impossiblein data stepTRANSREG is very flexible and allows optimal fitting.
Other models for continuous DV
TRANSREG
Example
A spline regression of baseball salary
proc t rans reg data = sashelp . baseba l l p l o t s = a l l ;model i d e n t i t y ( sa la ry ) = s p l i n e ( YrMajor nAtBat
nHi ts nHome nOuts ) ;run ;
showed non-monotonic relationships between salary andperformance
Other models for continuous DV
TRANSREG
The logistic family
Outline
11 Introduction
12 The OLS model
13 Other models for continuous DV
14 The logistic family
15 Count models
16 Multilevel models
17 Exercises and further reading
The logistic family
Introduction
When the dependent variable is categorical (eitherdichotomous, nominal or ordinal) OLS regression is notrecommended because
The assumption of normal residuals is violatedThe predicted values can be ludicrous
The usual method for these cases is logistic regression (either‘normal‘, multinomial or ordinal). The key output is odds ratioestimates.
The logistic family
What are odds ratios?
In OLS regression the dependent variable is continuous. Inlogistic, it’s not. How do we go from a 0 - 1 response to acontinuous one from −∞ to∞?
Find odds of something happening for each level of eachIV. e.g. odds of men and women voting for Obama. Thatgoes from 0 to∞Take ratio of the odds. That goes from 0 to∞ as well.Take log of the ratio for modeling. That goes from −∞ to∞But the OR is easier to interpret
The logistic family
Logistic regression - examples
Predict explain purchase of a product vs. no purchase -dichotomousPredict explain position on a team - multinomialPredict explain likelihood of returning - ordinal
The logistic family
What can go wrong
Coding 0 and 1 incorrectly - be careful which responseSAS is modellingEffect coding. For categorical IVs, SAS defaults to effectcoding, but reference coding is often betterQuasi-complete and complete separation - slicing the pietoo thinConcordant and discordant in output don’t mean what theyseem toNeed to use SLICE to get interaction odds ratios
The logistic family
Ordinal and multinomial logistic example
When the DV has multiple categories, they can be ordinal ornominal. If ordinal, use PROC LOGISTIC and the LINK = clogit.If nominal, LINK = glogit. Interpretation can be tricky, but isbasically a generalization of the dichotomous case.
Count models
Outline
11 Introduction
12 The OLS model
13 Other models for continuous DV
14 The logistic family
15 Count models
16 Multilevel models
17 Exercises and further reading
Count models
Introduction
When the DV is a count (a non-negative integer) and especiallywhen the counts aren’t very large, OLS is not recommended.Count models such as Poisson or negative binomial regressionshould be used. PROC GENMOD is used for these analyses.
Count models
Examples
How many cell phones does a person own?How many divorces will a person go through?
Count models
What can go wrong?
OverdispersionFailure to fitAbundance of 0’s - use ZIP or ZINB models
Multilevel models
Outline
11 Introduction
12 The OLS model
13 Other models for continuous DV
14 The logistic family
15 Count models
16 Multilevel models
17 Exercises and further reading
Multilevel models
Introduction
All the regression models above assume independent errors.When this is violated, things can go very wrong. MLM are oneway to deal with this.
Multilevel models
Examples
Repeated measurements of the same thing on the samepeopleMeasurements on people who are clustered
Exercises and further reading
Outline
11 Introduction
12 The OLS model
13 Other models for continuous DV
14 The logistic family
15 Count models
16 Multilevel models
17 Exercises and further reading
Exercises and further reading
Exercises
From your experience, list several regression problems andpropose a regression method for each
Exercises and further reading
Discussion
Exercises and further reading
Further reading - blog links
Simple linear regressionhttp://www.statisticalanalysisconsulting.com/what-is-simple-linear-regression/
Multiple linear regressionhttp://www.statisticalanalysisconsulting.com/what-is-multiple-linear-regression/
Survival analysishttp://www.statisticalanalysisconsulting.com/what-is-survival-analysis/
Alternative methods of regression when OLS is not righthttp://support.sas.com/resources/papers/proceedings15/3412-2015.pdf
Exercises and further reading
Further reading - books
Regression Analysis by Example by Samprit Chaterjee andAli HadiRegression Models for Categorical and Limited DependentVariables by J. Scott LongCategorical Data Analysis by Alan Agresti
Part VI
Multivariate statistics
Introduction
Sometimes there is no dependent variable, but you want to beable to figure out what is going on in a huge mass of data.
Exploratory factor analysis
Introduction
Factor analysis is a method of finding latent factors inmultivariate data. Latent variables are those that can’t bedirectly measured. Examples:
Personality scalesIQViews on complex issues
Exploratory factor analysis
Steps involved
Extracting factors - several methodsRotation - many methods, in two groups
Orthogonal - each factor is uncorrelated with others, easierto interpret but may not be realisticOblique - factors can be correlated
Interpretation - EFA is not determinate, much will dependon interpretation
Exploratory factor analysis
Example
Factor analysis of current statistics showed 2 factors:
proc f a c t o r data = sashelp . baseba l l r = varimax ;var nassts nAtBat −−nBB nouts ; run ;
Rotated Factor PatternFactor1 Factor2
nAtBat Times at Bat in 1986 0.88078 0.37098nHits Hits in 1986 0.87357 0.33843nHome Home Runs in 1986 0.81700 −0.19594nRuns Runs in 1986 0.91078 0.21618nRBI RBIs in 1986 0.92417 0.04853nBB Walks in 1986 0.74709 0.09339nAssts Assists in 1986 0.03736 0.92947nOuts Put Outs in 1986 0.45303 −0.03541nError Errors in 1986 0.10152 0.87866
Exploratory factor analysis
What can go wrong
GIGO can appear like GIPO - garbage in, pearls outNo simple structureUnclear number of factors
Principal component analysis (PCA)
Introduction
PCA is a dimension reduction method; use it when you have alarge number of variables that you want to reduce with minimalloss of information.
Principal component analysis (PCA)
What can go wrong
Components may not make senseComponents may not be useful for further analysisIf doing regression, consider partial least squares.
Cluster analysis
Introduction
Cluster analysis is a set of methods for finding groups ofobservations that go together in ways you are not aware of tostart. Examples:
Do patrons of a store tend to go into groups of people whobuy certain items?Do groups of politicians go into groups based on theirvotes on bills?
Cluster analysis
Methods
Agglomerative methods - start with items separate andgradually combine them using
A measure of distanceA measure of linkage
K-means methods - assign a number of clusters anddistance measure and let algorithm do the work
Cluster analysis
Example
Cluster analysis of the same variables
proc c l u s t e r data = sashelp . baseba l lmethod = average CCC pseudo p r i n t = 10
ou t t r ee = bb4c lus t ;var nAtBat −− nBB nassts nouts ne r ro r ;
run ;
Cluster analysis
Example - continued
showed evidence of 3 clusters:The SAS System 13:26 Monday, September 7, 2015 1
The CLUSTER ProcedureAverage Linkage Cluster Analysis
The SAS System 13:26 Monday, September 7, 2015 1
The CLUSTER ProcedureAverage Linkage Cluster Analysis
Criteria for the Number of Clusters
0
100
200
300
Pse
udo
T-S
quar
ed
0
100
200
300
Pse
udo
F-5
0
5
10
CC
C
2 4 6 8 10
Number of Clusters
Cluster analysis
Example - continued
with the following attributesThe SAS System 13:26 Monday, September 7, 2015 1The SAS System 13:26 Monday, September 7, 2015 1
0
10
20
30
Err
ors
in 1
98
6
0
250
500
750
1000
1250
Put
Out
s in
19
86
0
100
200
300
400
500
Ass
ists
in 1
98
6
0
20
40
60
80
100
Wa
lks
in 1
98
6
25
50
75
100
125
RB
Is in
19
86
25
50
75
100
125
Run
s in
19
86
0
10
20
30
40
Ho
me
Run
s in
19
86
50
100
150
200
250
Hits
in 1
98
6
100
200
300
400
500
600
700
Tim
es
at B
at i
n 1
98
6
Multidimensional scaling
Introduction
MDS is a method for figuring out how people are judgingsimilarity, or what similarity is based on. There are manyoptions and choices and (relatively) little literature.
Multidimensional scaling
Examples
How do people group politicians?How do customers group brands of items?
Multidimensional scaling
What can go wrong
Overfitting - use training and test setsResults may not be useful - try different methods
Exercises and further reading
Outline
18 Exercises and further reading
Exercises and further reading
Exercises
Come up with an example of a multivariate method that wouldbe useful in your research or business
Exercises and further reading
Further reading
Using Multivariate Statistics by Barbara Tabachnik andLinda Fidell
Part VII
Summary and so on
General thoughts
Statistics and data analysis are not tools to be applied in arote fashion.Data analysis should illuminate a scientific or businessphenomenon or attempt to solve a problem.The time to consult with a data analyst is as early aspossible and as often as possible
Summary
Descriptive statistics are a vital first step in any analysisGraphical methods are also vitalInference allows you to go from a sample to a population,but can have problemsRegression relates a DV to one or more IVsMultivariate statistics allow you to summarize large datasets.
Contact information
Peter FlomPeter Flom Consultingwww.StatisticalAnalysisConsulting.com917 488 7176
Thank you!
Top Related