descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before...

19
Descriptive Descriptive Descriptive Descriptive Statistics Statistics Statistics Statistics Kari Lock Harvard University Department of Statistics Rigorous Research in Engineering Education 12/3/09 12/3/09

Transcript of descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before...

Page 1: descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before making inferences about the population, we need to describethe data observed in

DescriptiveDescriptiveDescriptiveDescriptiveStatisticsStatisticsStatisticsStatistics

Kari LockHarvard University

Department of StatisticsRigorous Research in Engineering Education

12/3/0912/3/09

Page 2: descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before making inferences about the population, we need to describethe data observed in

Descriptive StatisticsDescriptive Statistics

• Before making inferences about the population, we need to describe the data observed in our sample

• visual displays: graphs, plots• summary statistics: numerical summaries of the datasummary statistics: numerical summaries of the data

• The appropriate descriptive techniques depend on the pp p p q ptype of data collected

Page 3: descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before making inferences about the population, we need to describethe data observed in

VisualizationVisualizationOne Categorical VariableOne Categorical VariableOne Categorical VariableOne Categorical Variable

G l di l th b t f l i h t

Pie Chart Bar Chart

Goal: display the number or percent of people in each category

Page 4: descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before making inferences about the population, we need to describethe data observed in

VisualizationVisualizationOne Quantitative VariableOne Quantitative Variable

Boxplot

One Quantitative VariableOne Quantitative Variable

Histogram

9010

0

e015

<‐ 50th percentile (median)<‐ 75th percentile

7080

Exa

m S

cor e

Freq

uenc

y

510

l

<‐ 25th percentile

6050 60 70 80 90 100

0 Outlier

Outlier

• Goal: Visually display the entire distribution of observed

Exam Score

Goal: Visually display the entire distribution of observed numbers.  Location, spread, shape and outliers are all visible.

Page 5: descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before making inferences about the population, we need to describethe data observed in

VisualizationVisualizationOne Quantitative VariableOne Quantitative VariableOne Quantitative VariableOne Quantitative Variable

• Symmetric:

Sk d• Skewed:

• Outlier – values that fall outside an overall pattern.

Page 6: descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before making inferences about the population, we need to describethe data observed in

DistributionDistribution

• If a histogram is theoretically known, it is smoothed over with a di t ib ti Thi di t ib ti t h t th h th ti ldistribution.   This distribution represents what the hypothetical histogram would look like if we were to take many samples

1500

1500

Freq

uenc

y

1000

Freq

uenc

y

1000

500

500

-3 -2 -1 0 1 2 3

0

-3 -2 -1 0 1 2 3

0

Page 7: descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before making inferences about the population, we need to describethe data observed in

VisualizationVisualizationOne Categorical One QuantitativeOne Categorical One QuantitativeOne Categorical, One QuantitativeOne Categorical, One Quantitative

Side-by-Side Boxplots Side-by-Side Boxplotsy p

9010

0

s

9010

0

Exa

m S

core

s

7080

Exa

m S

core

s

070

80

5060

Males Females50

60Control Treatment

Compare the numerical distributions of two (or more)

Gender

Males Females

Experimental Group

Compare the numerical distributions of two (or more) different groups

Page 8: descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before making inferences about the population, we need to describethe data observed in

VisualizationVisualizationOne Categorical Two QuantitativeOne Categorical Two QuantitativeOne Categorical, Two QuantitativeOne Categorical, Two Quantitative

Side-by-Side Boxplots

00

res

9010

Exa

m S

cor

7080

60

Control Treatment

• Often it’s not obvious whether two groups areExperimental Group

Often it s not obvious whether two groups are significantly different.  That’s what statistics is for!

Page 9: descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before making inferences about the population, we need to describethe data observed in

VisualizationVisualizationTwo QuantitativeTwo QuantitativeTwo QuantitativeTwo Quantitative

0

Scatterplot

9010

0

tOutlier

080

Pos

t-Tes

60 70 80 90 100

7

• G l Vi li th l ti hi b t t tit ti i bl

60 70 80 90 100

Pre-Test

• Goal: Visualize the relationship between two quantitative variables.  Note: Both variables have to be measured on the same set of individuals

Page 10: descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before making inferences about the population, we need to describethe data observed in

Summary StatisticsSummary StatisticsCategorical VariableCategorical VariableCategorical VariableCategorical Variable

• Proportion: Report the proportion of observations that fall in each category.  g y

Page 11: descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before making inferences about the population, we need to describethe data observed in

Summary StatisticsSummary StatisticsOne Quantitative VariableOne Quantitative VariableOne Quantitative VariableOne Quantitative Variable

Mean – the “average”: add everything up and divide by the sample size.

Median – the “middle” point at which 50% of the data are below and 50% are abovedata are below and 50% are above.

Standard Deviation – A measure of the averageStandard Deviation  A measure of the average distance from the mean.  Indicates how “spread out” the data are.out  the data are. Variance = (Standard Deviation)2

Page 12: descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before making inferences about the population, we need to describethe data observed in

Summary StatisticsSummary StatisticsOne Quantitative VariableOne Quantitative VariableOne Quantitative VariableOne Quantitative Variable

Lower standard deviation (0.33):Lower standard deviation (0.33):

Higher standard deviation (1.32): Same mean

Page 13: descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before making inferences about the population, we need to describethe data observed in

Summary StatisticsSummary StatisticsTwo Quantitative VariablesTwo Quantitative Variables

• Correlation (r) a measure of linear association between

Two Quantitative VariablesTwo Quantitative Variables

• Correlation (r) a measure of linear association between two numeric variables. 

• Correlation runs between ‐1 and 1.  Values closer to ‐1 and 1 represent stronger relationships, values closer to 0 are weaker.– Positive Correlation: as one variable goes up the other goes up

N i C l i i bl h h– Negative Correlation: as one variable goes up the other goes down

– Zero Correlation means no linear associationZero Correlation means no linear association

Page 14: descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before making inferences about the population, we need to describethe data observed in

CorrelationCorrelation10

0

r = .85

901

Test

7080

Pos

t-T

60 70 80 90 100

Pre-Test

Page 15: descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before making inferences about the population, we need to describethe data observed in

Correlation CautionsCorrelation Cautions

Page 16: descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before making inferences about the population, we need to describethe data observed in

Correlation CautionsCorrelation Cautions

• Correlation only measures linear associations:• Correlation only measures linear associations:

r = 06

84

6

y

3 2 1 0 1 2 3

02

-3 -2 -1 0 1 2 3

x

Page 17: descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before making inferences about the population, we need to describethe data observed in

OutliersOutliers

• Correlation can be highly affected by outliers:

r = .78

Correlation can be highly affected by outliers:

r = .17

8090

8590

6070

y 80y2

4050

Outlier 7075

30 40 50 60 70 80 90

x

75 80 85

x2

Page 18: descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before making inferences about the population, we need to describethe data observed in

OutliersOutliers

• Mean and standard deviation are also highly influencedMean and standard deviation are also highly influenced by outliers:

• Mean = 3, Standard Deviation = 0.21:

• Mean = 3.8, Standard Deviation = 1.86:

Outlier

Page 19: descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before making inferences about the population, we need to describethe data observed in

OutliersOutliers

l d f h l• ALWAYS plot your data to see if you have outliers

f d h l h ld f• If you do have extreme outliers, you should first check that you haven’t made an error entering the ddata

If h li l i i h ld• If the outliers are legitimate, you should run your analyses with and without the outliers to see how 

h h li i fl h lmuch the outliers influence the results