Download - Lies, damn lies and SAS to the rescue! - Statistical Analysis … · 2015-09-27 · Lies, damn lies and ... SAS to the rescue! Peter L. Flom Peter Flom Consulting SESUG September,

Lies, damn lies and ... SAS to the rescue!

Peter L. Flom

Peter Flom Consulting

SESUGSeptember, 2015

Broad outline

1 Introduction2 Descriptive statistics3 Descriptive graphics4 Inferential statistics5 The regression family6 Multivariate statistics

Part I

Introduction

Schedule

8:00 Descriptive statistics8:40 Break8:50 Descriptive graphics9:30 Break9:40 Inferential statistics

10:10 Break10:20 The regression family11:00 Break11:10 Multivariate statistics12:00 End

Introductions of participants and self

What I plan to do in this course

Give you a fundamental understanding of some basicstatistical methodsGive you a very brief survey of a lot of many moreadvanced methodsHelp you learn to work with statistics and statisticiansGive some SAS code you can give to othersNote that the most important stuff is at the beginning, soask questions!

What I don’t plan to do in this course

Teach you to be a statisticianTeach you SAS

What I want from you

AttentionQuestions - after all, it’s not even gradedFeedback - even anonymous is OK, after the course

Part II

Descriptive statistics

Introduction

Outline

1 Introduction

2 Measures of central tendency

3 Measures of spread

4 Other measures

Introduction

Descriptive vs. inferential stats

Descriptive statistics describe a variable or a sample.Inferential statistics let you infer from a sample to apopulation (more later)Descriptive statistics are necessary even when your goal isinference

Introduction

Types of descriptive statistics

For continuous variables descriptive statistics includeMeasures of central tendencyMeasures of dispersion or spreadMeasures of skewnessMeasures of kurtosisOther measures

For categorical variables, mostly we are limited to frequencies.

Measures of central tendency

Outline

1 Introduction



4 Other measures


The mean

What it is

Definition

The mean is the ordinary average. Add up the numbers anddivide by the number of numbers.

Or, if you want a formula

x̄ =

n∑i=1

xi

n

where x is the variable and there are n values of the variable.


The mean

What can go wrong

OutliersSkewnessThe clock problemThe rate problemDifferent scales


The mean

Mean salary

proc means data = sashelp . baseba l l maxdec = 2;var sa la ry ;

run ;

Analysis Variable : Salary 1987 Salary in $ ThousandsN Mean Std Dev Minimum Maximum

263 535.93 451.12 67.50 2460.00


The mean

Alternatives

1 The median2 The trimmed mean and Winsorized mean3 The geometric mean4 The harmonic mean

which are the topics of the next few slides


The median

Median salary

The median is simply the value that divides the distribution inhalf - half are lower, half are higher.

ods s e l e c t BasicMeasures ;proc u n i v a r i a t e data = sashelp . baseba l l ;

var sa la ry ;run ;

Basic Statistical MeasuresLocation Variability

Mean 535.9259 Std Deviation 451.11868Median 425.0000 Variance 203508Mode 750.0000 Range 2393

Interquartile Range 560.00000


The median

What can go wrong

Sometimes we want the outliersWhen there are many ties, the median may not becompletely determined.


The trimmed mean and Winsorized mean

What it is

A compromise between the mean and the median.To calculate the trimmed mean, you remove a certainpercentage of the highest and lowest points and then findthe mean of what remains.The Winsorized mean is similar but, rather than deletingthe points, you set them equal to the lowest or highestvalues that are not extreme.



What can go wrong

If the distribution is skewed, the trimmed mean is not anunbiased estimator for either the mean or median.



Trimmed and winsorized mean salary

ods s e l e c t TrimmedMeans WinsorizedMeans ;proc u n i v a r i a t e data = sashelp . baseba l l

trimmed = .1 winsor ized = . 1 ;var sa la ry ;

run ;

Trimmed per tail% N

Trimmedmean

SEWinsorizedmean

SE

10.27 27463.89

25.08486.04

25.09


The geometric mean

What it is

Definition

It’s like the mean, except instead of adding the numbers andthen dividing by the count, you multiply the numbers and takethe nth root of the product

or, if you want a formula (n∏

i=1xi

)1/N


The geometric mean

What can go wrong

Doesn’t work when any value is 0 or negative


The geometric mean

When to use it

Useful for combining measures on different scales. E.g.Candidates for college - combine SAT (0 to 1600) and HSGPA (0 to 4)Proportional growth over a series of times


The geometric mean

Geometric mean of college applicants

data co l l ege ;i npu t name $ GPA SAT @@;d a ta l i n es ;J i l l 3.0 1550 Joe 4.0 1500

;data co l l ege ; set co l l ege ;gmean = geomean (GPA, SAT ) ;amean = mean(GPA,SAT ) ;

run ;proc p r i n t data = co l l ege ; run ;

Obs name GPA SAT gmean amean1 Jill 3 1550 68.1909 776.52 Joe 4 1500 77.4597 752.0


The harmonic mean

Harmonic mean of round trip travel

Definition

It is the reciprocal of the arithmetic mean of the reciprocals of aset of numbers.

H = n1

x1+ 1

x2+... 1

xn


The harmonic mean

When to use it

Averaging rates, such as speeds or batting averagesAveraging ratios such as price earning ratios


The harmonic mean

What can go wrong

Like the geometric mean, it doesn’t work with negative numbersor 0’s.


The harmonic mean

SAS code

data speed ;i npu t To From @@;

d a ta l i n es ;50 80 40 70

;data speed ; set speed ;hmean = harmean ( to , from ) ; amean = mean( to , from ) ;t ime = 100 / to + 100 / from ; actualspeed= 200 / t ime ; run ;

proc p r i n t data = speed ; run ;

Obs To From hmean amean time actualspeed1 50 80 61.54 65 3.25 61.542 40 70 50.91 55 3.93 50.91


Exercises

Exercises

1 Name 3 variables for which the mean would not beappropriate

2 For each of those, decide which measure of centraltendency would be appropriate and why?

Measures of spread

Outline

1 Introduction



4 Other measures

Measures of spread

Standard deviation

What it is

Definition

The standard deviation is the square root of the averagesquared difference between the mean and the individual values.

Or

s =

√n∑

i=1(xi−x̄)2

n−1

Measures of spread

Standard deviation

What can go wrong

If the mean isn’t a good measure of central tendency, the sdisn’t a good measure of spread.

Measures of spread

Standard deviation

SD of salary

proc means data = sashelp . baseba l l ;var sa la ry ;

run ;

Basic Statistical MeasuresLocation Variability

Mean 535.9259 Std Deviation 451.11868Median 425.0000 Variance 203508Mode 750.0000 Range 2393

nterquartile Range 560.00000

Measures of spread

Standard deviation

Alternatives

Median absolute deviation (MAD)Range and interquartile rangeMore quantilesGini’s mean differenceVariations on MAD(also see graphics, later)

Measures of spread

MAD

What it is

Definition

The median absolute deviation is what it says:1 Find the median2 Find each value’s deviation from the median3 Take absolute values4 Find the median of those

Measures of spread

MAD

What can go wrong?

Not very efficientNot appropriate with asymmetric distributions

Measures of spread

Range and interquartile range

What it is

The range is just the smallest to largest valueThe IQR is the 1st quartile to the 3rd quartile

Measures of spread


What can go wrong

The range is strongly affected by even a single outlierThe IQR is not affected at all by outliers

Measures of spread


SAS code

ods s e l e c t RobustScale ;proc u n i v a r i a t e data = sashelp . baseba l l RobustScale ;

var sa la ry ;run ;

Robust Measures of ScaleMeasure Value Estimate of SigmaInterquartile Range 560.0000 415.1285Gini’s Mean Difference 468.0400 414.7897MAD 275.0000 407.7150Sn 381.6320 382.9424Qn 327.7303 325.9949

Measures of spread


Exercises

List 3 variables that would not be well analyzed by the SDand suggest alternatives.

Other measures

Outline

1 Introduction



4 Other measures

Other measures

Skewness

What it is

Definition

Skewness is the asymmetry of the distribution.

1n

n∑i=1

(xi−x̄)3

[ 1n−1

n∑i=1

(xi−x̄)2]3/2

Skewness can take on any number, negative means left skew,positive means right skew, 0 means symmetrical.

Other measures

Skewness

Alternatives and problems

One good way to look at skewness is with density plots (to becovered later)What can go wrong

A single outlier can generate skewness.Again, if the mean is not an appropriate measure of centraltendency, this is not an appropriate measure of skew

Other measures

Skewness

Skewness of salary

ods s e l e c t Moments ;proc u n i v a r i a t e data = sashelp . baseba l l ;

var sa la ry ;run ;

MomentsN 263 Sum Weights 263Mean 535.93 Sum Observations 140948.507Std Deviation 451.12 Variance 203508.064Skewness 1.59 Kurtosis 3.05896473Coeff Variation 84.18 Std Error Mean 27.82

Other measures

Kurtosis

What it is

It is a measure of the peakedness of the distribution. However,it is very nonintuitive and hard to interpret. It can be used toindicate a non-normal distribution, but its use beyond that istricky (and confuses even experienced people).Better to use graphical measures such as density plots (to becovered later)

Other measures

Exercises and further reading

Exercises

List 3 variables that are markedly skewed, either to the right orleft

Other measures


Discussion

Other measures


Further reading

www.statisticalanalysisconsulting.com/how-to-go-wrong-with-the-mean/

www.statisticalanalysisconsulting.com/measures-of-central-tendency-the-harmonic-mean/


www.statisticalanalysisconsulting.com/measures-of-central-tendency-the-trimmed-mean-and-median/

www.statisticalanalysisconsulting.com/statistical-measures-of-spread/

Exploratory Data Analyis by John Tukey











Part III

Descriptive graphics

Introduction

Outline

5 Introduction

6 Univariate graphics

7 Bivariate graphics

8 Trivariate and multivariate graphics

9 Time series data

10 Exercises and further reading

Introduction

General thoughts on statistical graphics - 1

A good graph willShow the dataInduce the viewer to think about the substance of the dataAvoid distorting the dataPresent many numbers in a small space

Introduction


Make large data sets coherentEncourage the eye to look at different parts of the dataReveal several levels of detailServe a clear purpose

Introduction


But a good graphic will notBe a substitute for a tableBe a substitute for a model

Introduction


Use of color, shape and so onConsider the audienceNot all chart junk is bad

Univariate graphics

Outline

5 Introduction




9 Time series data


Univariate graphics

Univariate discrete data

Introduction

This is usually counts or proportions of something, e.g. numberof Democrats, Republicans and others. Here:

Pie charts should be avoidedDot charts are often goodA table may be even betterLog scales are sometimes helpful

Univariate graphics


Pie chart with 51 categories - a mess

Geographical_Area Alabama Alaska Arizona ArkansasCalifornia Colorado Connecticut DelawareDistrict of Columbia Florida Georgia HawaiiIdaho Illinois Indiana IowaKansas Kentucky Louisiana MaineMaryland Massachusetts Michigan MinnesotaMississippi Missouri Montana NebraskaNevada New Hampshire New Jersey New MexicoNew York North Carolina North Dakota OhioOklahoma Oregon Pennsylvania Puerto RicoRhode Island South Carolina South Dakota TennesseeTexas Utah Vermont VirginiaWashington West Virginia Wisconsin Wyoming

Univariate graphics


Dot chart with 51 categories

California

Texas

New York

Florida

Illinois

Pennsylvania

Ohio

Michigan

Georgia

North Carolina

New Jersey

Virginia

Washington

Arizona

Massachusetts

Indiana

Tennessee

Missouri

Maryland

Wisconsin

Minnesota

Colorado

Alabama

South Carolina

Louisiana

Kentucky

Puerto Rico

Oregon

Oklahoma

Connecticut

Iowa

Mississippi

Arkansas

Kansas

Utah

Nevada

New Mexico

West Virginia

Nebraska

Idaho

Maine

New Hampshire

Hawaii

Rhode Island

Montana

Delaware

South Dakota

Alaska

North Dakota

Vermont

District of Columbia

Wyoming

Sta

te

1 5 10 15 20 25 30 35

Population (millions, log scale)

MidwestNortheastSouthWestregion

Univariate graphics


Pie chart with 9 categories - a table or dot chart

East North Central Division46395654

East South Central Division18084651

Middle Atlantic Division40621237

Mountain Division21784507

New England Division14303542

Pacific Division49070441

South Atlantic Division58398377

West North Central Division20165794

West South Central Division35235521

Univariate graphics


Pie chart with 4 categories - a table or text

Geographical_Area Midwest Region Northeast Region South Region West Region

Univariate graphics


SAS code

The SAS code for the pie charts isn’t shown because youshouldn’t use it. That for the dot plot is complex, I can e-mail itto you if you want.

Univariate graphics

Univariate continuous data

Introduction

Histograms can be misleading, at least if they areunadornedDensity plots are often better, and several smooths can beused.Box plots provide a useful summaryWhen N is small, strip charts can be useful

Univariate graphics


Density plot - example

13:27 Thursday, August 20, 2015 113:27 Thursday, August 20, 2015 1

-1000 0 1000 2000 3000

1987 Salary in $ Thousands

0.0000

0.0003

0.0005

0.0008

0.0010

0.0013

De

nsity

Kernel, c=2Kernel, c=0.5Kernel

Density plot, salaries

Univariate graphics


Density plot - SAS code

proc sgp lo t data = sashelp . baseba l l ;dens i t y sa la ry / type = kerne l ;dens i t y sa la ry / type = kerne l ( c = . 5 )

c u r v e l a b e l a t t r s = ( co lo r = red ) ;dens i t y sa la ry / type = kerne l ( c = 2)

c u r v e l a b e l a t t r s = ( co l o r = green ) ;xax is min = 0;

run ;

Univariate graphics


Box plot - example

21:40 Friday, August 7, 2015 121:40 Friday, August 7, 2015 1

0

500

1000

1500

2000

2500

19

87

Sa

lary

in $

Th

ou

san

ds

Density plot, salaries

proc sgp lo t data =sashelp . baseba l l ;vbox sa la ry ;

run ;

Univariate graphics


Box plot - example, log scale

250

500

750

1250

1750

2500

1987

Sal

ary

in $

Tho

usan

ds

Salary by division

proc sgp lo t data =sashelp . baseba l l ;vbox sa la ry ;yax is type = loglogbase = 10 l o g s t y l e = l i n e a r ;

run ;

Univariate graphics


Strip plot - example

0.90 0.95 1.00 1.05 1.10

jitter

0

500

1000

1500

2000

2500

Tho

usan

ds19

87 S

alar

y in

$

Salary strip plot

Univariate graphics


Strip plot - SAS code

data s t r i p ;se t sashelp . baseba l l ;j i t t e r = 1∗ ( ranun i (1234) / 5) + . 9 ;

run ;t i t l e " Salary s t r i p p l o t " ;proc sgp lo t data = s t r i p ;

s c a t t e r x = j i t t e r y = sa la ry ;xax is min = 0 max = 2 d i sp lay = none ;

run ;

Bivariate graphics

Outline

5 Introduction




9 Time series data


Bivariate graphics

Both categorical

Mosaic plots

A little known and under-used plot is the mosaic plot. It is a wayof visualizing a crosstabulation. For example, sex and party ID.

The SAS System 08:20 Monday, July 6, 2015 1

The FREQ Procedure

The SAS System 08:20 Monday, July 6, 2015 1

The FREQ Procedure

ods s e l e c t MosaicPlot ;proc f req data = mosaic ;

t ab l e pa r t y ∗sex /p l o t s = mosaic ;weight count ;

run ;

Bivariate graphics

One categorical

Introduction

When N is relatively small, a strip chart is good - it shows all thedata. When N is larger, a parallel boxplot shows a lot of the keyinformation.

Bivariate graphics

One categorical

Parallel boxplot - example

AE AW NE NW

League and Division

0

500

1000

1500

2000

2500

1987

Sal

ary

in $

Tho

usan

ds

Salary by division

t i t l e " Salary by d i v i s i o n " ;proc sgp lo t data =

sashelp . baseba l l ;vbox sa la ry

/ category = d iv ;run ;

Bivariate graphics

One categorical

Strip chart - example

22:14 Friday, August 7, 2015 122:14 Friday, August 7, 2015 1

80 100 120 140 160

1987 Salary in $ Thousands

AE

NE

AW

NW

Leag

ue a

nd D

ivis

ion

Salary by division, rookies

t i t l e " Salary by d i v i s i o n ,rook ies " ;

proc sgp lo t data =sashelp . baseba l l ;s c a t t e r x = sa la ryy = d iv ;

where yrmajor l e 1 ;run ;

Bivariate graphics

Neither categorical

The scatter plot

The most common (and one of the best) basic options here isthe scatter plot. But there are variations.

Bivariate graphics

Neither categorical

Scatter plot - basic example

0 1000 2000 3000 4000

Career Hits

0

500

1000

1500

2000

2500

1987

Sal

ary

in $

Tho

usan

ds

Salary by division

proc sgp lo t data =sashelp . baseba l l ;s c a t t e r x = CrHi tsy = Salary ;

run ;

Bivariate graphics

Neither categorical

Scatter plot - log scale

100 500 1000 2000 4000

Career Hits

250

500

750

12501750

2500

1987

Sal

ary

in $

Tho

usan

ds

Salary by division

proc sgp lo t data =sashelp . baseba l l ;

s c a t t e r x = CrHi tsy = Salary ;

xax is type = logl o g s t y l e = l i n e a r ;

yax is type = logl o g s t y l e = l i n e a r ;

run ;

Bivariate graphics

Neither categorical

Scatter plot - log scale plus loess

100 500 1000 2000 4000

Career Hits

100

500

10001500

2500

1987

Sal

ary

in $

Tho

usan

ds

Loess1987 Salary in $ Thousands

Salary by division

Bivariate graphics

Neither categorical

Scatter, log scale with loess

proc sgp lo t data = sashelp . baseba l l ;xax is l a b e l = " Career h i t s

( log scale ) "type = log l o g s t y l e = l i n e a r ;

yax is l a b e l = " Salary i n thousands of$ ( log scale ) "

type = log l o g s t y l e = l i n e a r ;s c a t t e r x = CrHi ts y = sa la ry ;loess x = CrHi ts y = Salary / nomarkers ;e l l i p s e x = CrHi ts y = Salary ;

run ;

Bivariate graphics

Neither categorical

Scatter plot - A fancy example

The SAS SystemThe SAS System

Scatter plot with density plots

AL

AK

AZ

AR

CA

COCT

DE

DC

FL

GA

HI

IDIL

IN

IA

KS

KY

LA

ME

MD

MA

MI

MN

MS

MO

MTNE

NVNH

NJ

NM NY

NC

ND

OH

OK

OR

PA

RI

SC

SD

TN

TX

UTVT

VA

WA

WV

WI

WY

Prediction ellipse (α=.05)

AL

AK

AZ

AR

CA

COCT

DE

DC

FL

GA

HI

IDIL

IN

IA

KS

KY

LA

ME

MD

MA

MI

MN

MS

MO

MTNE

NVNH

NJ

NM NY

NC

ND

OH

OK

OR

PA

RI

SC

SD

TN

TX

UTVT

VA

WA

WV

WI

WY

Prediction ellipse (α=.05)

2 4 6 8 10

Unemployment (%)

0.0 0.1 0.2

Density

4

6

8

10

12

Infa

nt M

ort

alit

y (p

er

XX

X)

0.0

0.1

0.2

De

nsity

Bivariate graphics

Neither categorical

Scatter plot - Another fancy example

The SAS SystemThe SAS System

Box plot w/barchart

120

140

160

180

200

220

Weight

0

100000

200000

N

55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 79

ht

Trivariate and multivariate graphics

Outline

5 Introduction




9 Time series data



All continuous

The scatterplot matrix - example

Salary by division

0 1000 2000

0 100 200 300 400 500

0 1000 2000 3000 4000

0

500

1000

1500

2000

2500

0

100

200

300

400

500

0

1000

2000

3000

4000

1987 Salary in $Thousands

Career Home Runs

Career Hits

proc sgsca t te r data =sashelp . baseba l l ;ma t r i x CrHi ts CrHome

Salary ;run ;


All continuous

The scatterplot matrix - a more complex example

12:34 Monday, August 17, 2015 112:34 Monday, August 17, 2015 1

proc sgsca t te r data =sashelp . baseba l l ;ma t r i x CrHi ts CrHomeCrBB CrRbi /markera t t r s = ( symbol =c i r c l e f i l l e d s ize = 8)d iagonal = ( ke rne l )e l l i p s ecolorresponse = sa la ry ;

run ;


All continuous

Bubble plot

50 75 100 125 150 175

Career Hits

0

5

10

15

20

25

30

Car

eer

Hom

e R

uns

Bubble plot of rookie salaries

t i t l e " Bubble p l o to f rook ie s a l a r i e s " ;

proc sgp lo t data =sashelp . baseba l l ;bubble x = CrHi ts

y = CrHomes ize = sa la ry ;

where yrmajor l e 1 ;run ;


Some continuous

Coplot

08:30 Sunday, July 5, 2015 108:30 Sunday, July 5, 2015 1

Career Hits

Ca

ree

r H

om

e R

uns

League and Division = NWLeague and Division = NE

League and Division = AWLeague and Division = AE

0 1000 2000 3000 40000 1000 2000 3000 4000

0

100

200

300

400

500

0

100

200

300

400

500

proc sgpanel data =sashelp . baseba l l ;panelby d iv ;s c a t t e r x = CrHi ts

y = CrHome ;run ;


Some continuous

Scatter plot matrix with group variable - example

Several statistics by league and division - 5 years or less

NWNEAWAELeague and Division

Career Times at Bat1987 Salary in $ Tho...Career Home RunsCareer Hits

Ca

ree

r T

ime

s a

t Ba

t1

98

7 S

ala

ry in

$ T

...C

are

er

Ho

me

Run

sC

are

er

Hits

proc sgsca t te r data =sashelp . baseba l l ;

t i t l e " Several s t a t i s t i c sby league and d i v i s i o n− 5 years or less " ;

mat r i x CrHi ts CrHomeSalary CrAtBat / group= d iv d iagonal =

( kerne l ) ;where yrmajor l e 5 ;

run ;


Some continuous

Scatter plot matrix - another example

12:34 Monday, August 17, 2015 112:34 Monday, August 17, 2015 1

proc sgsca t te r data =sashelp . baseba l l ;p l o t ( sa la ry ) ∗( nH i ts nHome NBB nAssts )/ markera t t r s = ( symbol =

c i r c l e f i l l e d s ize = 8)loesscolorresponse = yrmajorcolormodel = twocolorramp ;

run ;


None continuous

Introduction

When all variables are categorical, generalizations of themosaic plot can be used.

Time series data

Outline

5 Introduction




9 Time series data


Time series data

Electrical workers over time

Timeseries decomposition 14:24 Monday, August 3, 2015 1

The TIMESERIES Procedure



240

260

280

300

320

ele

ctri

cal w

ork

ers

, tho

usa

nds

Jan Jul Jan Jul Jan Jul Jan Jul Jan Jul Jan Jul1977 1978 1979 1980 1981 1982

DATE

Series Values for ELECTRIC

Time series data

Electrical workers over time





Seasonal Decomposition/Adjustment for ELECTRIC

Jan Jan Jan Jan Jan Jan1977 1978 1979 1980 1981 1982

240

260

280

300

320

Sea

sona

lly A

djus

ted


0.97

0.98

0.99

1.00

1.01

1.02

1.03

Irre

gula

r


0.925

0.950

0.975

1.000

1.025

1.050

Sea

sona

l-Irr

egul

ar


240

260

280

300

320

Tre

nd-C

ycle

Time series data

Electrical workers over time - SAS code

t i t l e " Timeser ies decomposit ion " ;proc t imeser ies data=sashelp . workers out=_ n u l l _

p l o t s =( se r i es decomp ) ;i d date i n t e r v a l =month ;var e l e c t r i c ;

run ;


Outline

5 Introduction




9 Time series data



Describe a set of variables and say what graph you woulduse for it and why


Discussion


Further reading - blog links

Parallel box plots http://www.statisticalanalysisconsulting.com/graphics-for-bivariate-data-parallel-box-plots/

Pie is delicious but not nutritious http://www.statisticalanalysisconsulting.com/graphics-for-univariate-data-pie-is-delicious-but-not-nutritious/

Scatterplotshttp://www.statisticalanalysisconsulting.com/scatterplots-and-enhancements/

Graphics: The good, the bad and the ugly http://www.statisticalanalysisconsulting.com/graphics-the-good-the-bad-and-the-ugly/

http://www.statisticalanalysisconsulting.com/graphics-for-bivariate-data-parallel-box-plots/



http://www.statisticalanalysisconsulting.com/graphics-for-univariate-data-pie-is-delicious-but-not-nutritious/



http://www.statisticalanalysisconsulting.com/scatterplots-and-enhancements/

http://www.statisticalanalysisconsulting.com/scatterplots-and-enhancements/

http://www.statisticalanalysisconsulting.com/graphics-the-good-the-bad-and-the-ugly/




Further reading - books

Creating more effective graphs by Naomi RobbinsVisualizing data by William S. ClevelandThe elements of graphing data by William S. ClevelandA trout in the milk by Howard Wainer

Part IV

Inferential statistics

From sample to population

A population is the entire set of all the subjects (people orwhatever) that you want to study.A sample is a subset of that population.A random sample is a sample where all subjects have adefinable chance of being selected

Null and alternative hypotheses

The null hypothesis is usually "nothing is going on"The alternative is "something is going on"Trial analogy

What is a p value?

Definition

If, in the population from which this sample was randomlydrawn, the null was strictly true, what is probability of getting atest statistic at least as large as the one we got in a sample thesize of the one we have?

In other words, if we do 1000 really silly things, what proportionwill come out significant?

Experiments vs. observational studies

In an experiment subjects are randomly selected and thenrandomly assigned to a conditionIn an observational study neither of these are trueSome people use quasi-experiment where one of theabove is true

Problems

Not usually the question we want to askStrongly affected by sample size

The Bayesian approach

IdeaSet a prior - often a uniform priorLet data modify it.

AdvantagesMore intuitiveLets you have a prior

DisadvantagesHard to set a priorUninformed prior usually gives similar results to frequentistapproachStill not the question we are interested in

What we want

Effect sizes and measure of their accuracyRisk reward analysis

Further reading

The Insignificance of Statistical Significance Testing byDouglas JohnsonThe Cult of Statistical Significance by Stephen Zilliak andDeirdre McCloskey

Part V

The regression family

Introduction

Outline

11 Introduction

12 The OLS model

13 Other models for continuous DV

14 The logistic family

15 Count models

16 Multilevel models


Introduction

What is regression?

Regression is a term for a variety of models relating dependentvariables (usually just one) to one or more independentvariables.

Introduction

Varieties of regression

The type of regression depends on the nature of the dependentvariable and on the nature of the relationships.

Continuous - OLS and alternatives (see below)Dichotomous - LogisticCategorical (>2 levels) - Multinomial logisticOrdinal - ordinal logisticCount - Poisson, negative binomial and variationsTime to event - survival models

The OLS model

Outline

11 Introduction

12 The OLS model



15 Count models



The OLS model

What it is

Ordinary least squares is the most common regression modeland it is what people mean when they say ‘regression‘.The model is Y = b0 + b1x1 + b2x2 + ...bpxp + e where e is errorand is normally distributed with 0 mean and constant variance.

The OLS model

What can go wrong

OverfittingNonlinear fitsNonnormal residualsDependent dataCollinearity

Other models for continuous DV

Outline

11 Introduction

12 The OLS model



15 Count models




Introduction

Multivariate adaptive regression splines (MARS) - PROCADAPTIVEREGQuantile regression - PROC QUANTREGTranformations - PROC TRANSREG

More information: See my paper at SGF 2015.


MARS

Introduction

MARS models allow extremely flexible curves (called splines) tobe fit to data.MARS models are most useful

In high dimensional spacesWhen there is little substantive reason to assume linearityor a low-level polynomial fit


MARS

Advnatages and disadvantages of MARS models

Advantages of MARS models:Very flexible fitting of the relationship between independentand dependent variablesModel selection methods that can sharply reduce thedimension of the model.SAS implementation of these models extends them todependent variables in the exponential family.Can be more accurate than GLM, with greater parsimony

Disadvantages of MARS models:Hard to interpretLess familiar


MARS

Example

I modeled baseball salary as a function of various attributes ofthe players. ADAPTIVEREG got a significantly higher R2 withconsiderably fewer terms. But the result is very hard tointerpret.

proc adapt ivereg data = sashelp . baseba l lp l o t s = a l l d e t a i l s = bases ;

c lass team ;model sa la ry = YrMajor nAtBat nHi ts nHome nOuts ;

run ;


Quantile regression

Introduction

There are at least three motivations for quantile regression:DV is bimodal or multimodalHighly skewed DVSubstantive interest in the quantiles

Advantages include:No assumptions about the distribution of the residualsMore flexible hypotheses

Diadvanages include:Not as powerful as OLS regression when that isappropriate modelNot robust to high leverage points.


Quantile regression

Example

A quantile regression of baseball salary:

proc quantreg data = sashelp . baseba l l p l o t s = a l l ;model sa la ry = YrMajor nAtBat nHi ts nHome nOuts /

q u a n t i l e = ( 0 . 1 , 0 .5 , 0 . 9 ) ;run ;

revealed that the relationship between salary and variousplayer attributes was different at different levels of salary. e.g.:

Number of home runs was more important at high levels ofsalary.

but this should be viewed with caution because of high leveragepoints.


TRANSREG

Introduction

Sometimes it makes sense to transform one or morevariables.Can do in data step butPROC TRANSREG offers many options and allowsautomation of some tasksSome transformations (e.g. splines) are hard or impossiblein data stepTRANSREG is very flexible and allows optimal fitting.


TRANSREG

Example

A spline regression of baseball salary

proc t rans reg data = sashelp . baseba l l p l o t s = a l l ;model i d e n t i t y ( sa la ry ) = s p l i n e ( YrMajor nAtBat

nHi ts nHome nOuts ) ;run ;

showed non-monotonic relationships between salary andperformance


TRANSREG

The logistic family

Outline

11 Introduction

12 The OLS model



15 Count models



The logistic family

Introduction

When the dependent variable is categorical (eitherdichotomous, nominal or ordinal) OLS regression is notrecommended because

The assumption of normal residuals is violatedThe predicted values can be ludicrous

The usual method for these cases is logistic regression (either‘normal‘, multinomial or ordinal). The key output is odds ratioestimates.

The logistic family

What are odds ratios?

In OLS regression the dependent variable is continuous. Inlogistic, it’s not. How do we go from a 0 - 1 response to acontinuous one from −∞ to∞?

Find odds of something happening for each level of eachIV. e.g. odds of men and women voting for Obama. Thatgoes from 0 to∞Take ratio of the odds. That goes from 0 to∞ as well.Take log of the ratio for modeling. That goes from −∞ to∞But the OR is easier to interpret

The logistic family

Logistic regression - examples

Predict explain purchase of a product vs. no purchase -dichotomousPredict explain position on a team - multinomialPredict explain likelihood of returning - ordinal

The logistic family

What can go wrong

Coding 0 and 1 incorrectly - be careful which responseSAS is modellingEffect coding. For categorical IVs, SAS defaults to effectcoding, but reference coding is often betterQuasi-complete and complete separation - slicing the pietoo thinConcordant and discordant in output don’t mean what theyseem toNeed to use SLICE to get interaction odds ratios

The logistic family

Ordinal and multinomial logistic example

When the DV has multiple categories, they can be ordinal ornominal. If ordinal, use PROC LOGISTIC and the LINK = clogit.If nominal, LINK = glogit. Interpretation can be tricky, but isbasically a generalization of the dichotomous case.

Count models

Outline

11 Introduction

12 The OLS model



15 Count models



Count models

Introduction

When the DV is a count (a non-negative integer) and especiallywhen the counts aren’t very large, OLS is not recommended.Count models such as Poisson or negative binomial regressionshould be used. PROC GENMOD is used for these analyses.

Count models

Examples

How many cell phones does a person own?How many divorces will a person go through?

Count models

What can go wrong?

OverdispersionFailure to fitAbundance of 0’s - use ZIP or ZINB models

Multilevel models

Outline

11 Introduction

12 The OLS model



15 Count models



Multilevel models

Introduction

All the regression models above assume independent errors.When this is violated, things can go very wrong. MLM are oneway to deal with this.

Multilevel models

Examples

Repeated measurements of the same thing on the samepeopleMeasurements on people who are clustered


Outline

11 Introduction

12 The OLS model



15 Count models




Exercises

From your experience, list several regression problems andpropose a regression method for each


Discussion


Further reading - blog links

Simple linear regressionhttp://www.statisticalanalysisconsulting.com/what-is-simple-linear-regression/

Multiple linear regressionhttp://www.statisticalanalysisconsulting.com/what-is-multiple-linear-regression/

Survival analysishttp://www.statisticalanalysisconsulting.com/what-is-survival-analysis/

Alternative methods of regression when OLS is not righthttp://support.sas.com/resources/papers/proceedings15/3412-2015.pdf

http://www.statisticalanalysisconsulting.com/what-is-simple-linear-regression/

http://www.statisticalanalysisconsulting.com/what-is-simple-linear-regression/

http://www.statisticalanalysisconsulting.com/what-is-multiple-linear-regression/

http://www.statisticalanalysisconsulting.com/what-is-multiple-linear-regression/

http://www.statisticalanalysisconsulting.com/what-is-survival-analysis/

http://www.statisticalanalysisconsulting.com/what-is-survival-analysis/

http://support.sas.com/resources/papers/proceedings15/3412-2015.pdf

http://support.sas.com/resources/papers/proceedings15/3412-2015.pdf


Further reading - books

Regression Analysis by Example by Samprit Chaterjee andAli HadiRegression Models for Categorical and Limited DependentVariables by J. Scott LongCategorical Data Analysis by Alan Agresti

Part VI

Multivariate statistics

Introduction

Sometimes there is no dependent variable, but you want to beable to figure out what is going on in a huge mass of data.

Exploratory factor analysis

Introduction

Factor analysis is a method of finding latent factors inmultivariate data. Latent variables are those that can’t bedirectly measured. Examples:

Personality scalesIQViews on complex issues


Steps involved

Extracting factors - several methodsRotation - many methods, in two groups

Orthogonal - each factor is uncorrelated with others, easierto interpret but may not be realisticOblique - factors can be correlated

Interpretation - EFA is not determinate, much will dependon interpretation


Example

Factor analysis of current statistics showed 2 factors:

proc f a c t o r data = sashelp . baseba l l r = varimax ;var nassts nAtBat −−nBB nouts ; run ;

Rotated Factor PatternFactor1 Factor2

nAtBat Times at Bat in 1986 0.88078 0.37098nHits Hits in 1986 0.87357 0.33843nHome Home Runs in 1986 0.81700 −0.19594nRuns Runs in 1986 0.91078 0.21618nRBI RBIs in 1986 0.92417 0.04853nBB Walks in 1986 0.74709 0.09339nAssts Assists in 1986 0.03736 0.92947nOuts Put Outs in 1986 0.45303 −0.03541nError Errors in 1986 0.10152 0.87866


What can go wrong

GIGO can appear like GIPO - garbage in, pearls outNo simple structureUnclear number of factors

Principal component analysis (PCA)

Introduction

PCA is a dimension reduction method; use it when you have alarge number of variables that you want to reduce with minimalloss of information.

Principal component analysis (PCA)

What can go wrong

Components may not make senseComponents may not be useful for further analysisIf doing regression, consider partial least squares.

Cluster analysis

Introduction

Cluster analysis is a set of methods for finding groups ofobservations that go together in ways you are not aware of tostart. Examples:

Do patrons of a store tend to go into groups of people whobuy certain items?Do groups of politicians go into groups based on theirvotes on bills?

Cluster analysis

Methods

Agglomerative methods - start with items separate andgradually combine them using

A measure of distanceA measure of linkage

K-means methods - assign a number of clusters anddistance measure and let algorithm do the work

Cluster analysis

Example

Cluster analysis of the same variables

proc c l u s t e r data = sashelp . baseba l lmethod = average CCC pseudo p r i n t = 10

ou t t r ee = bb4c lus t ;var nAtBat −− nBB nassts nouts ne r ro r ;

run ;

Cluster analysis

Example - continued

showed evidence of 3 clusters:The SAS System 13:26 Monday, September 7, 2015 1

The CLUSTER ProcedureAverage Linkage Cluster Analysis

The SAS System 13:26 Monday, September 7, 2015 1

The CLUSTER ProcedureAverage Linkage Cluster Analysis

Criteria for the Number of Clusters

0

100

200

300

Pse

udo

T-S

quar

ed

0

100

200

300

Pse

udo

F-5

0

5

10

CC

C

2 4 6 8 10

Number of Clusters

Cluster analysis

Example - continued

with the following attributesThe SAS System 13:26 Monday, September 7, 2015 1The SAS System 13:26 Monday, September 7, 2015 1

0

10

20

30

Err

ors

in 1

98

6

0

250

500

750

1000

1250

Put

Out

s in

19

86

0

100

200

300

400

500

Ass

ists

in 1

98

6

0

20

40

60

80

100

Wa

lks

in 1

98

6

25

50

75

100

125

RB

Is in

19

86

25

50

75

100

125

Run

s in

19

86

0

10

20

30

40

Ho

me

Run

s in

19

86

50

100

150

200

250

Hits

in 1

98

6

100

200

300

400

500

600

700

Tim

es

at B

at i

n 1

98

6

Multidimensional scaling

Introduction

MDS is a method for figuring out how people are judgingsimilarity, or what similarity is based on. There are manyoptions and choices and (relatively) little literature.


Examples

How do people group politicians?How do customers group brands of items?


What can go wrong

Overfitting - use training and test setsResults may not be useful - try different methods


Outline



Exercises

Come up with an example of a multivariate method that wouldbe useful in your research or business


Further reading

Using Multivariate Statistics by Barbara Tabachnik andLinda Fidell

Part VII

Summary and so on

General thoughts

Statistics and data analysis are not tools to be applied in arote fashion.Data analysis should illuminate a scientific or businessphenomenon or attempt to solve a problem.The time to consult with a data analyst is as early aspossible and as often as possible

Summary

Descriptive statistics are a vital first step in any analysisGraphical methods are also vitalInference allows you to go from a sample to a population,but can have problemsRegression relates a DV to one or more IVsMultivariate statistics allow you to summarize large datasets.

Contact information

Peter FlomPeter Flom Consultingwww.StatisticalAnalysisConsulting.com917 488 7176

www.StatisticalAnalysisConsulting.com

Thank you!