Introduction to Statistics: Political Science (Class 9) Review.
-
Upload
chastity-harrington -
Category
Documents
-
view
222 -
download
1
Transcript of Introduction to Statistics: Political Science (Class 9) Review.
Introduction to Statistics: Political Science (Class 9)
Review
Probability of having cardiovascular disease
• Purpose of statistics: – Inferences about populations using samples
• We draw a random sample of 1,000 adults and 405 have some form of CVD
• Based on our sample, if we randomly select one adult from the population: what is the probability that they have cardiovascular disease?
Conditional Probability
No CVD CVD
Exercise less than 3 days/week (N=602)
30.3% 28.9%
Exercise 3 or more days/week (N=398)
30.2% 10.6%
• Probability of exercising <3 days/week?• Probability of CVD among those who
exercise <3 days/week?• Probability of CVD among those who exercise 3
or more days/week?
Association between exercise and CVD?
No CVD CVD
Exercise less than 3 days/week (N=602)
30.3% 28.9%
Exercise 3 or more days/week (N=398)
30.2% 10.6%
p1 = 28.9/(30.3+28.9) = 0.488
p2 = 10.6/(30.2+10.6) = 0.260
Difference = 0.488 - 0.260 = .228
Those who exercise less than 3 days/week .228 (22.8%) more likely to have CVD
Specifying and testing hypotheses
• Difference of proportions = .228
• What’s our null hypothesis?
• Why a “null hypothesis”? Why not test whether the difference is .228?
• Central limit theorem– In repeated sampling, the distribution of our
estimates of the mean (or difference of means or slope) will be normally distributed and centered over the true population value
Central limit theorem
1 standard error
0
Proposed true value
Comparing proportions
• Difference of proportions = .228
p1 = 28.9/(30.3+28.9) = 0.488 (N=602)
p2 = 10.6/(30.2+10.6) = 0.260 (N=398)
• Standard error of this difference:
Comparing proportions
• So, standard error of difference is the square root of: (.488*(1-.488)/602)+(.260*(1-.260)/398)– Which is .0299
• Difference of proportions = .237
Hypotheses
• Null hypothesis: – There is no difference in the rate of CVD
between those who exercise less than 3 days/week and those who do
• Alternate hypothesis: – There is a difference in the rate of CVD
between those who exercise less than 3 days/week and those who do
• (i.e., the difference is not 0)
If 0 is was the true difference, it would be very unlikely that we would find a difference 7.93 (.237/.0299)
standard errors from that value by chance
1 standard error
0
Proposed true value
Does exercise cause lower CVD?
• Reverse causation? Might CVD cause exercise?
• Failure to account for confounds – Typically leads to over-estimating the strength
of a relationship (not always… but usually)
0
10
20
30
40
50
60
70
80
90
100
0 5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
Bush FT
Obam
a F
TDemocrats Republicans
Specification and Interpretation
Multivariate Regression
Does exercise make CDV less likely?
• Regression (predict CDV)
• Estimated likelihood of CDV if exercise 4 days/week?
• What might confound our estimate of the relationship between exercise and CVD?
Coef. SE T P-valueDays Exercise (0-7) -0.06 .001 ? 0.000 Constant 0.56 .002 ? 0.000
Controlling for confounds
Coef. SE T P-valueDays Exercise (0-7) -0.03 .001 -3.0 0.002Days Fast Food (0-7) 0.04 .002 2.0 0.048 Constant 0.42 .002 21.0 0.000
0
10
20
30
40
50
60
70
80
90
100
0 5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
Bush FT
Obam
a F
TDemocrats Republicans
% C
hance CV
D
Days per Week Exercise
High Fast Food
Low Fast Food
Controlling for dichotomous confounds
• Predicted probability of CVD for – 2 days exercise, 2 days Fast food, smoker
Coef. SE T P-valueDays Exercise (0-7) -0.03 .001 -3.0 0.002Days Fast Food (0-7) 0.04 .002 2.0 0.048 Smoker (1=yes) 0.11 .001 11.0 0.000 Constant 0.38 .002 19.0 0.000
Nominal Variables
• Variable that does not have an “order” to it– Nothing is “higher” or “lower”
• Create set of dichotomous variables
• Always interpret coefficients with respect to the reference category
Controlling for nominal confounds
Coef. SE T P-valueDays Exercise (0-7) -0.03 .001 -3.0 0.002Days Fast Food (0-7) 0.03 .002 1.5 0.135 Smoker (1=yes) 0.09 .001 9.0 0.000 South (1=yes) 0.03 .002 1.5 0.137 West (1=yes) -0.01 .002 -0.5 0.642 Northeast (1=yes) 0.02 .002 1.0 0.410 Constant 0.34 .002 17.0 0.000(Midwest is excluded category)
What if we wanted to test whether including region indicators improves fit of the model?
Non-linear relationships
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1,600,000
1,800,000
60,0
00
660,
000
1,26
0,00
0
1,86
0,00
0
2,46
0,00
0
3,06
0,00
0
3,66
0,00
0
4,26
0,00
0
4,86
0,00
0
5,46
0,00
0
6,06
0,00
0
6,66
0,00
0
Yearly Income ($s)
Ho
me
Va
lue
($
s)
Logarithms
Why use a logarithmic transformation?You think the relationship looks like this…
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1,600,000
1,800,000
10 11 12 13 14 15 16
Logged Yearly Income
Ho
me
Va
lue
Logarithms
Squared term – U(or ∩)-shaped relationship
Coef. SE T P
Age -0.007 0.004 -1.740 0.082
Constant 0.122 0.209 0.580 0.561
Coef. SE T P
Age -0.065 0.025 -2.630 0.009
Age-squared 0.001 0.000 2.390 0.017
Constant 1.554 0.635 2.450 0.015
Age and political ideology (-2=very conservative, 2=very liberal)
Age and Political IdeologyCoef. SE T P
Age -0.065 0.025 -2.630 0.009
Age-squared 0.001 0.000 2.390 0.017
Constant 1.554 0.635 2.450 0.015
Age Age2 -0.065*Age .0005574*Age2 Constant Predicted Value
18 324 -1.178 0.181 1.554 0.557
28 784 -1.832 0.437 1.554 0.159
38 1444 -2.487 0.805 1.554 -0.128
48 2304 -3.141 1.284 1.554 -0.303
58 3364 -3.795 1.875 1.554 -0.366
68 4624 -4.450 2.577 1.554 -0.319
78 6084 -5.104 3.391 1.554 -0.159
-1
-0.5
0
0.5
1
18 28 38 48 58 68 78 88
Age
Ide
olo
gy
(-
2=
ve
ry c
on
se
rva
tiv
e, 2
=v
ery
lib
era
l)
Create indicators from an ordered variable
Party Identification (-3 to 3)
Seven Variables:Strong Republican (1=yes) Weak Republican (1=yes) Lean Republican (1=yes) Pure Independent (1=yes) Lean Democrat (1=yes) Weak Democrat (1=yes) Strong Democrat (1=yes)
Predict Obama Favorability (1-4)
Coef. SE T P
Strong Republican -1.632 0.161 -10.160 0.000
Weak Republican -0.707 0.198 -3.580 0.000
Lean Republican -1.235 0.181 -6.810 0.000
Lean Democrat 0.674 0.197 3.430 0.001
Weak Democrat 0.494 0.187 2.640 0.009
Strong Democrat 0.595 0.159 3.750 0.000
Constant 2.940 0.134 21.870 0.000
Excluded category: Pure Independents
1
2
3
4
Str
ong
Rep
ublic
an
Wea
kR
epub
lican
Lean
Rep
ublic
an
Pur
eIn
depe
nden
t
Lean
Dem
ocra
t
Wea
kD
emoc
rat
Str
ong
Dem
ocra
t
Obama Favorability
Predict Obama Favorability (1-4)
Coef. SE T P
Strong Republican -0.397 0.150 -2.650 0.008
Weak Republican 0.528 0.189 2.790 0.006
Pure Independent 1.235 0.181 6.810 0.000
Lean Democrat 1.909 0.188 10.150 0.000
Weak Democrat 1.729 0.179 9.680 0.000
Strong Democrat 1.831 0.148 12.360 0.000
Constant 1.705 0.122 14.010 0.000
New excluded category: Leaning Republicans
Interactions
• One variable moderates the effect of another – i.e., the relationship between one variable and an outcome depends on the value of another variable
Coef. SE T P
Party Affiliation (-3=strong R; 3=strong D) 1.286 0.878 1.460 0.143
Voted in 2008 -1.138 1.484 -0.770 0.443
Party Affiliation x Voted in 2008 3.575 0.918 3.900 0.000
Constant 61.100 1.358 44.980 0.000
61.100 + 1.286*Party – 1.138*Voted + 3.575*Party*Voted + u
61.100 + Party*1.286 + Party*Voted*3.575 – 1.138*Voted + u
61.100 + Party(1.286 + Voted*3.575) – 1.138*Voted + u
61.100 + Party*1.286 + Voted*Party*3.575 – Voted*1.138 + u
61.100 + Party*1.286 + Voted(Party*3.575 –1.138) + u
OR
Regression estimates an equation…
Party Aff. Voted Party Aff. Voted Party x Voted Constant Predicted Value
Coefficients 1.286 -1.138 3.575 61.100
-3 0 -3.858 0 0 61.100 57.242
-2 0 -2.572 0 0 61.100 58.528
-1 0 -1.286 0 0 61.100 59.814
0 0 0.000 0 0 61.100 61.100
1 0 1.286 0 0 61.100 62.386
2 0 2.572 0 0 61.100 63.672
3 0 3.858 0 0 61.100 64.959
Party Aff. Voted Party Aff. Voted Party x Voted Constant Predicted Value
Coefficients 1.286 -1.138 3.575 61.100
-3 1 -3.858 -1.13775 -10.7258 61.100 45.378
-2 1 -2.572 -1.13775 -7.1505 61.100 50.240
-1 1 -1.286 -1.13775 -3.57525 61.100 55.101
0 1 0.000 -1.13775 0 61.100 59.962
1 1 1.286 -1.13775 3.575252 61.100 64.824
2 1 2.572 -1.13775 7.150504 61.100 69.685
3 1 3.858 -1.13775 10.72576 61.100 74.547
40
50
60
70
80
Strong Republican Weak Republican Lean Republican Independent Lean Democrat Weak Democrat Strong DemocratSu
pp
ort
fo
r C
om
pa
rati
ve
Eff
ec
tiv
en
es
s R
es
ea
rch
Did not Vote Voted
Establishing causality
Dealing with confounds
• Theory + multivariate regression
• Experiments
Dealing with reverse causation
• Theory
• Experiments
Experiments
• What is the key characteristic of an experiment?
• How does this address reverse causality?
• How does it address confounds?
• Weaknesses/limitations of experiments?
Exam Expectations
• Describe probabilities / conditional probabilities• Write hypotheses
– Demonstrate understanding of how null hypotheses relate to the central limit theorem
• Test difference of proportions (formula for SE will be provided)
• Interpreting multivariate regression– Relationships (slopes)– Predicted values– Sketch graphs of relationships
• Discuss strengths and limitations of analyses – Why an estimated slope might be biased– Benefits and limitations of experiments
Notes
• Homework 3 graded
• Homework 4 due Thursday 12/9
• Office hours next week – email to come
• Exam December 14 at 2pm