descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before...
Transcript of descriptive - stat.duke.edukfl5/Lock_RREE_Descriptive_2009.pdf · Descriptive Statistics • Before...
DescriptiveDescriptiveDescriptiveDescriptiveStatisticsStatisticsStatisticsStatistics
Kari LockHarvard University
Department of StatisticsRigorous Research in Engineering Education
12/3/0912/3/09
Descriptive StatisticsDescriptive Statistics
• Before making inferences about the population, we need to describe the data observed in our sample
• visual displays: graphs, plots• summary statistics: numerical summaries of the datasummary statistics: numerical summaries of the data
• The appropriate descriptive techniques depend on the pp p p q ptype of data collected
VisualizationVisualizationOne Categorical VariableOne Categorical VariableOne Categorical VariableOne Categorical Variable
G l di l th b t f l i h t
Pie Chart Bar Chart
Goal: display the number or percent of people in each category
VisualizationVisualizationOne Quantitative VariableOne Quantitative Variable
Boxplot
One Quantitative VariableOne Quantitative Variable
Histogram
9010
0
e015
<‐ 50th percentile (median)<‐ 75th percentile
7080
Exa
m S
cor e
Freq
uenc
y
510
l
<‐ 25th percentile
6050 60 70 80 90 100
0 Outlier
Outlier
• Goal: Visually display the entire distribution of observed
Exam Score
Goal: Visually display the entire distribution of observed numbers. Location, spread, shape and outliers are all visible.
VisualizationVisualizationOne Quantitative VariableOne Quantitative VariableOne Quantitative VariableOne Quantitative Variable
• Symmetric:
Sk d• Skewed:
• Outlier – values that fall outside an overall pattern.
DistributionDistribution
• If a histogram is theoretically known, it is smoothed over with a di t ib ti Thi di t ib ti t h t th h th ti ldistribution. This distribution represents what the hypothetical histogram would look like if we were to take many samples
1500
1500
Freq
uenc
y
1000
Freq
uenc
y
1000
500
500
-3 -2 -1 0 1 2 3
0
-3 -2 -1 0 1 2 3
0
VisualizationVisualizationOne Categorical One QuantitativeOne Categorical One QuantitativeOne Categorical, One QuantitativeOne Categorical, One Quantitative
Side-by-Side Boxplots Side-by-Side Boxplotsy p
9010
0
s
9010
0
Exa
m S
core
s
7080
Exa
m S
core
s
070
80
5060
Males Females50
60Control Treatment
Compare the numerical distributions of two (or more)
Gender
Males Females
Experimental Group
Compare the numerical distributions of two (or more) different groups
VisualizationVisualizationOne Categorical Two QuantitativeOne Categorical Two QuantitativeOne Categorical, Two QuantitativeOne Categorical, Two Quantitative
Side-by-Side Boxplots
00
res
9010
Exa
m S
cor
7080
60
Control Treatment
• Often it’s not obvious whether two groups areExperimental Group
Often it s not obvious whether two groups are significantly different. That’s what statistics is for!
VisualizationVisualizationTwo QuantitativeTwo QuantitativeTwo QuantitativeTwo Quantitative
0
Scatterplot
9010
0
tOutlier
080
Pos
t-Tes
60 70 80 90 100
7
• G l Vi li th l ti hi b t t tit ti i bl
60 70 80 90 100
Pre-Test
• Goal: Visualize the relationship between two quantitative variables. Note: Both variables have to be measured on the same set of individuals
Summary StatisticsSummary StatisticsCategorical VariableCategorical VariableCategorical VariableCategorical Variable
• Proportion: Report the proportion of observations that fall in each category. g y
Summary StatisticsSummary StatisticsOne Quantitative VariableOne Quantitative VariableOne Quantitative VariableOne Quantitative Variable
Mean – the “average”: add everything up and divide by the sample size.
Median – the “middle” point at which 50% of the data are below and 50% are abovedata are below and 50% are above.
Standard Deviation – A measure of the averageStandard Deviation A measure of the average distance from the mean. Indicates how “spread out” the data are.out the data are. Variance = (Standard Deviation)2
Summary StatisticsSummary StatisticsOne Quantitative VariableOne Quantitative VariableOne Quantitative VariableOne Quantitative Variable
Lower standard deviation (0.33):Lower standard deviation (0.33):
Higher standard deviation (1.32): Same mean
Summary StatisticsSummary StatisticsTwo Quantitative VariablesTwo Quantitative Variables
• Correlation (r) a measure of linear association between
Two Quantitative VariablesTwo Quantitative Variables
• Correlation (r) a measure of linear association between two numeric variables.
• Correlation runs between ‐1 and 1. Values closer to ‐1 and 1 represent stronger relationships, values closer to 0 are weaker.– Positive Correlation: as one variable goes up the other goes up
N i C l i i bl h h– Negative Correlation: as one variable goes up the other goes down
– Zero Correlation means no linear associationZero Correlation means no linear association
CorrelationCorrelation10
0
r = .85
901
Test
7080
Pos
t-T
60 70 80 90 100
Pre-Test
Correlation CautionsCorrelation Cautions
Correlation CautionsCorrelation Cautions
• Correlation only measures linear associations:• Correlation only measures linear associations:
r = 06
84
6
y
3 2 1 0 1 2 3
02
-3 -2 -1 0 1 2 3
x
OutliersOutliers
• Correlation can be highly affected by outliers:
r = .78
Correlation can be highly affected by outliers:
r = .17
8090
8590
6070
y 80y2
4050
Outlier 7075
30 40 50 60 70 80 90
x
75 80 85
x2
OutliersOutliers
• Mean and standard deviation are also highly influencedMean and standard deviation are also highly influenced by outliers:
• Mean = 3, Standard Deviation = 0.21:
• Mean = 3.8, Standard Deviation = 1.86:
Outlier
OutliersOutliers
l d f h l• ALWAYS plot your data to see if you have outliers
f d h l h ld f• If you do have extreme outliers, you should first check that you haven’t made an error entering the ddata
If h li l i i h ld• If the outliers are legitimate, you should run your analyses with and without the outliers to see how
h h li i fl h lmuch the outliers influence the results