Introduction to Biostatistics for Clinical and Translational Researchers
-
Upload
nevada-johnson -
Category
Documents
-
view
36 -
download
0
description
Transcript of Introduction to Biostatistics for Clinical and Translational Researchers
Introduction to Biostatistics for Clinical and Translational
Researchers
KUMC Departments of Biostatistics & Internal Medicine
University of Kansas Cancer Center
FRONTIERS: The Heartland Institute of Clinical and Translational Research
Course Information
Jo A. Wick, PhDOffice Location: 5028 RobinsonEmail: [email protected]
Lectures are recorded and posted at http://biostatistics.kumc.edu under ‘Events and Opportunities’
Inferences: Hypothesis Testing
Experiment
An experiment is a process whose results are not known until after it has been performed.The range of possible outcomes are known in advanceWe do not know the exact outcome, but would like to
know the chances of its occurrenceThe probability of an outcome E, denoted P(E), is
a numerical measure of the chances of E occurring.0 ≤ P(E) ≤ 1
Probability
The most common definition of probability is the relative frequency view:
Probabilities for the outcomes of a random variable x are represented through a probability distribution:
# of times =P = =
total # of observations of
x ax a
x
x
P(x
)
0 2 4 6 8 10 12 14
0.0
00
.05
0.1
00
.15
1 2 3 4 5 6 7 8 9 10 11 12
05
10
15
x
P(x
)
-4 -2 0 2 4
0.0
0.2
0.4
Probability of LOS = 6 days
Length of stay = 6 days
Population Parameters
Most often our research questions involve unknown population parameters:
What is the average BMI among 5th graders?
What proportion of hospital patients acquire a hospital-based infection?
To determine these values exactly would require a census.
However, due to a prohibitively large population (or other considerations) a sample is taken instead.
Sample Statistics
Statistics describe or summarize sample observations.
They vary from sample to sample, making them random variables.
We use statistics generated from samples to make inferences about the parameters that describe populations.
Sampling Variability
Population
Samples
0 1x s
0.15 1.1x s
0.1 0.98x s Sampling Distribution of x
μ σ
Recall: Hypotheses
Null hypothesis “H0”: statement of no differences or association between variablesThis is the hypothesis we test—the first step in the
‘recipe’ for hypothesis testing is to assume H0 is true
Alternative hypothesis “H1”: statement of differences or association between variablesThis is what we are (usually) trying to prove
Hypothesis Testing
One-tailed hypothesis: outcome is expected in a single direction (e.g., administration of experimental drug will result in a decrease in systolic BP)H1 includes ‘<‘ or ‘>’
Two-tailed hypothesis: the direction of the effect is unknown (e.g., experimental therapy will result in a different response rate than that of current standard of care)H1 includes ‘≠‘
Hypothesis Testing
The statistical hypotheses are statements concerning characteristics of the population(s) of interest:Population mean: μPopulation variability: σPopulation rate (or proportion): πPopulation correlation: ρ
Example: It is hypothesized that the response rate for the experimental therapy is greater than that of the current standard of care.πExp > πSOC ← This is H1.
Recall: Decisions
Type I Error (α): a true H0 is incorrectly rejected“An innocent man is proven GUILTY in a court of law”Commonly accepted rate is α = 0.05
Type II Error (β): failing to reject a false H0
“A guilty man is proven NOT GUILTY in a court of law”Commonly accepted rate is β = 0.2
Power (1 – β): correctly rejecting a false H0
“Justice has been served”Commonly accepted rate is 1 – β = 0.8
Decisions
ConclusionTruth
H1 H0
H1 Correct: Power Type I Error
H0 Type II Error Correct
Basic Recipe for Hypothesis Testing
1. State H0 and H1
2. Assume H0 is true ← Fundamental assumption!!
3. Collect the evidence—from the sample data, compute the appropriate sample statistic and the test statistic
Test statistics quantify the level of evidence within the sample—they also provide us with the information for computing a p-value (e.g., t, chi-square, F)
4. Determine if the test statistic is large enough to meet the a priori determined level of evidence necessary to reject H0 (. . . or, is p < α?)
Example: Carbon Monoxide
An experiment is undertaken to determine the concentration of carbon monoxide in air.
It is a concern that the actual concentration is significantly greater than 10 mg/m3.
Eighteen air samples are obtained and the concentration for each sample is measured.The outcome x is carbon monoxide concentration in
samples.The characteristic (parameter) of interest is μ—the true
average concentration of carbon monoxide in air.
Step 1: State H0 & H1
H1: μ > 10 mg/m3 ← We suspect!
H0: μ ≤ 10 mg/m3 ← We assume in order to test!
x
P(x
)
-4 -2 0 2 4
0.0
0.2
0.4
μ = 10
Step 2: Assume μ = 10
Step 3: Evidence
10.25 10.37 10.66
10.47 10.56 10.22
10.44 10.38 10.63
10.40 10.39 10.26
10.32 10.35 10.54
10.33 10.48 10.68
Sample statistic: =10.43x
Test statistic:
x μ
ts
n
0 10.43 101.79
1.0218
What does 1.79 mean? How do we use it?
Student’s t Distribution
Remember when we assumed H0 was true?
x
P(x
)
-4 -2 0 2 4
0.0
0.2
0.4
Step 2: Assume μ = 10
μ = 10
Student’s t Distribution
What we were actually doing was setting up this theoretical Student’s t distribution from which the p-value can be calculated:
x
P(x
)
-4 -2 0 2 4
0.0
0.2
0.4
x μt
sn
0 10 100
1.0218
t = 0
Student’s t Distribution
Assuming the true air concentration of carbon monoxide is actually 10 mg/mm3, how likely is it that we should get evidence in the form of a sample mean equal to 10.43?
x
P(x
)
-4 -2 0 2 4
0.0
0.2
0.4
Step 2: Assume μ = 10
μ = 10=10.43x
P x 10.43 ?
Student’s t Distribution
We can say how likely by framing the statement in terms of the probability of an outcome:
x
P(x
)
-4 -2 0 2 4
0.0
0.2
0.4
x μt
sn
0 10 100
1.0218
t = 0
t = 1.79
p = P(t ≥ 1.79) = 0.0456
Step 4: Make a Decision
Decision rule: if p ≤ α, the chances of getting the actual collected evidence from our sample given the null hypothesis is true are very small.The observed data conflicts with the null ‘theory.’The observed data supports the alternative
‘theory.’Since the evidence (data) was actually observed
and our theory (H0) is unobservable, we choose to believe that our evidence is the more accurate portrayal of reality and reject H0 in favor of H1.
Step 4: Make a Decision
What if our evidence had not been in as great of degree of conflict with our theory?p > α: the chances of getting the actual collected
evidence from our sample given the null hypothesis is true are pretty high
We fail to reject H0.
x
P(x
)
-4 -2 0 2 4
0.0
0.2
0.4
10 x =10.1
P x 10.1 ?
Decision
How do we know if the decision we made was the correct one?We don’t!If α = 0.05, the chances of our decision being an incorrect
reject of a true H0 are no greater than 5%.We have no way of knowing whether we made this kind
of error—we only know that our chances of making it in this setting are relatively small.
Which test do I use?
What kind of outcome do you have?Nominal? Ordinal? Interval? Ratio?
How many samples do you have?Are they related or independent?
Types of Tests
One Sample
Measurement Level
Population Parameter
HypothesesSample Statistic
Inferential Method(s)
NominalProportion
πH0: π = π0
H1: π ≠ π0
Binomial test orz test (if np > 10 & nq > 10)
Ordinal Median MH0: M = M0
H1: M ≠ M0
m = p50 Wilcoxon signed-rank test
Interval Mean μH0: μ = μ0
H1: μ ≠ μ0
Student’s t or Wilcoxon (if non-normal or small n)
Ratio Mean μH0: μ = μ0
H1: μ ≠ μ0
Student’s t or Wilcoxon (if non-normal or small n)
xp =
n
x
x
Types of Tests
Parametric methods: make assumptions about the distribution of the data (e.g., normally distributed) and are suited for sample sizes large enough to assess whether the distributional assumption is met
Nonparametric methods: make no assumptions about the distribution of the data and are suitable for small sample sizes or large samples where parametric assumptions are violatedUse ranks of the data values rather than actual data values
themselvesLoss of power when parametric test is appropriate
Types of Tests
Two Independent Samples
Measurement Level
Population Parameters
HypothesesSample
StatisticsInferential Method(s)
Nominal π1, π2
H0: π1 = π2
H1: π1 ≠ π2
Fisher’s exact or Chi-square (if cell counts > 5)
Ordinal M1, M2
H0: M1 = M2
H1: M1 ≠ M2
m1, m2 Median test
Interval μ1, μ2
H0: μ1 = μ2
H1: μ1 ≠ μ2
Student’s t or Mann-Whitney (if non-normal, unequal variances or small n)
Ratio μ1, μ2
H0: μ1 = μ2
H1: μ1 ≠ μ2
Student’s t or Mann-Whitney (if non-normal, unequal variances or small n)
11
1
xp =
n2
22
xp =
n
1x 2x
1x 2x
# Groups
2
Normal or large n
Independent Samples
2-sample t
Dependent Samples
Paired t
Non-normal or small n
Independent Samples
Wilcoxon Signed-Rank
Dependent Samples
Wilcoxon Rank-Sum
> 2
Normal or large n
Independent Samples
ANOVA
Dependent Samples
2-way ANOVA
Non-normal or small n
Independent Samples
Kruskal-Wallis
Dependent Samples
Friedman’s
Comparing Central Tendency
Two-Sample Test of Means
Clotting times (minutes) of blood for subjects given one of two different drugs:
It is hypothesized that the two drugs will result in different blood-clotting times.H1: μB ≠ μG
H0: μB = μG
Drug B Drug G
8.8 8.4 9.9 9.0
7.9 8.7 11.1 9.6
9.1 9.6 8.7 10.4
9.5
x
x1
2
8.75
9.74
Two-Sample Test of Means
What we’re actually hypothesizing: H0: μB - μG = 0
x
P(x
)
-4 -2 0 2 4
0.0
0.2
0.4
μB - μG = 0
x x1 2 0.99
P x x
P x x
1 2
1 2
0.99 ?
0.99 ?
Evidence!
Two-Sample Test of Means
What we’re actually hypothesizing: H0: μB - μG = 0
x
P(x
)
-4 -2 0 2 4
0.0
0.2
0.4
t = 0
x xt
s sn n
1 2
2 21 2
1 2
8.75 9.742.475
0.40p = P(|t| > -2.475) = 0.03
t = -2.48 t = +2.48
***Two-sided tests detect ANY evidence in EITHER direction that the null difference is unlikely!
Assumptions of t
In order to use the parametric Student’s t test, we have a few assumptions that need to be met:Approximate normality of the observationsIn the case of two samples, approximate equality of the
sample variances
Assumption Checking
To assess the assumption of normality, a simple histogram would show any issues with skewness or outliers:
Assumption Checking
Skewness
Assumption Checking
Other graphical assessments include the QQ plot:
Assumption Checking
Violation of normality:
Assumption Checking
To assess the assumption of equal variances (when groups = 2), simple boxplots would show any issues with heteroscedasticity:
Assumption Checking
Rule of thumb: if the larger variance is more than 2 times the smaller, the assumption has been violated
Now what?
If you have enough observations (20? 30?) to be able to determine that the assumptions are feasible, check them.If violated:
• Try a transformation to correct the violated assumptions (natural log) and reassess; proceed with the t-test if fixed
• If a transformation doesn’t work, proceed with a non-parametric test• Skip the transformation altogether and proceed to the non-
parametric test
If okay, proceed with t-test.
Now what?
If you have too small a sample to adequately assess the assumptions, perform the non-parametric test instead.For the one-sample t, we typically substitute the Wilcoxon
signed-rank testFor the two-sample t, we typically substitute the Mann-
Whitney test
Consequences of Nonparametric Testing
Robust!Less powerful because they are based on ranks
which do not contain the full level of information contained in the raw data
When in doubt, use the nonparametric test—it will be less likely to give you a ‘false positive’ result.
Speaking of Power
“How many subjects do we need?”Statistical methods can be used to determine the
required number of patients to meet the trial’s principal scientific objectives.
Other considerations that must be accounted for include availability of patients and resources and the ethical need to prevent any patient from receiving inferior treatment.We want the minimum number of patients required to
achieve our principal scientific objective.
The Size of a Clinical Trial
For the chosen level of significance (type I error rate, α), a clinically meaningful difference (Δ) between two groups can be detected with a minimally acceptable power (1 – β) with n subjects.
Example: Detecting a Difference
Primary objective: To compare pain improvement in knee OA for new treatment A compared to standard treatment S.
Primary outcome: Change in pain score from baseline to 24 weeks (continuous).
Data analysis: Comparison of mean change in pain score of patients on treatment A (μ1) versus standard (μ2) using a two-sided t-test at the α = 0.05 level of significance.
Example: Detecting a Difference
Difference to detect (Δ): It has been determined that a difference of 10 on this pain scale is clinically meaningful.If standard therapy results in a 5 point decrease, our new
therapy would need to show a decrease of at least 15 (5 + 10) to be declared clinically different from the standard.
We would like to be 80% sure that we detect this difference as statistically significant.
Example: Detecting a Difference
What usually occurs on the standard? This is important information because it tells us about the
behavior of the outcome (pain scale) in these patients.If the pain scale has great variability, it may be difficult to
detect small to moderate changes (signal-to-noise)!
‘Signal-to-Noise’C
ha
ng
e in
Pa
in fr
om
Ba
selin
e
S A
Difference = 20!
Ch
an
ge
in P
ain
fro
m B
ase
line
S A
0Difference = 20!
Example: Detecting a Difference
We have:H0: μ1 = μ2 versus H1: μ1 μ2 (Δ= 0) α = 0.051 – β = 0.80Δ = 10
For continuous outcomes we need to determine what difference would be clinically meaningful, but specified in the form of an effect size which takes into account the variability of the data.
Example: Detecting a Difference
Effect size is the difference in the means divided by the standard deviation, usually of the control or comparison group, or the pooled standard deviation of the two groups
where
1 2d
2 21 2
1 2n n
Example: Detecting a Difference
Power Calculations an interesting interactive web-based tool to show the relationship between power and the sample size, variability, and difference to detect.
A decrease in the variability of the data results in an increase in power for a given sample size.
An increase in the effect size results in a decrease in the required sample size to achieve a given power.
Increasing α results in an increase in the required sample size to achieve a given power.
Inferences on Two Means
Example: Smoking cessationTwo types of therapy: x = {behavioral therapy, literature}Dependent variable: y = % decrease in number of
cigarettes smoked per day after six months of therapy
Behavioral Therapy Literature Only
10 6
20 2
65 0
0 12
30 4
Smoking Cessation
Research question: Is behavioral therapy in addition to education better than education alone in getting smokers to quit?
H0: μ1 = μ2 versus H1: μ1 ≠ μ2
Two independent samples t-test IF:the change is approximately normal OR can be
transformed to an approximate normal distribution (e.g., natural log)
the variability within each group is approximately the same (ROT: no more than 2x difference)
Smoking Cessation
Conclusion: Adding behavioral therapy to cessation education results in—on average—a greater reduction in cigarettes smoked per day at six months post-therapy when compared to education alone (t30.9 = -2.87, p < 0.01).
Reject H0: μ1 = μ2
Smoking Cessation
The 95% confidence interval is:
-8.39 ≤ μ1 - μ2 ≤ -1.42Interpretation: On average, behavioral therapy
resulted in an additional reduction of 4.9% (95%CI: 1.42%, 8.39%) relative to control.
Confidence Intervals
What exactly do confidence intervals represent?Remember that theoretical sampling distribution
concept?It doesn’t actually exist, it’s only mathematical.What would we see if we took sample after sample after
sample and did the same test on each . . .
Confidence Intervals
Suppose we actually took sample after sample . . . 100 of them, to be exact
Every time we take a different sample and compute the confidence interval, we will likely get a slightly different result simply due to sampling variability.
Confidence Intervals
Suppose we actually took sample after sample . . . 100 of them, to be exact
95% confident means: “In 95 of the 100 samples, our interval will contain the true unknown value of the parameter. However, in 5 of the 100 it will not.”
Confidence Intervals
Suppose we actually took sample after sample . . . 100 of them, to be exact
Our “confidence” is in the procedure that produces the interval—i.e., it performs well most of the time.
Our “confidence” is not directly related to our particular interval—we cannot say “The probability that the mean difference is between (1.4,8.4) is 0.95.”
Inferences on More Than Two Means
Example: Smoking cessationThree types of therapy: x = {pharmaceutical therapy,
behavioral therapy, literature}Dependent variable: y = % decrease in number of
cigarettes smoked per day after six months of therapy
Pharmaceutical Therapy Behavioral Therapy Literature Only
10 10 6
30 0 20
60 6 0
32 0 12
65 30 4
Smoking Cessation
Research question: Is therapy in addition to education better than education alone in getting smokers to quit? If so, is one therapy more effective?
H0: μ1 = μ2 = μ3 versus H1: At least one μ is differentMore than 2 independent samples requires an
ANOVA:the change is approximately normal OR can be transformed
to an approximate normal distribution (e.g., natural log)the variability within each group is approximately the same
(ROT: no more than 2x difference)
Smoking Cessation
ANOVA produces a table:
One-way ANOVA indicates you have a single categorical factor x (e.g., treatment) and a single continuous response y and your interest is in comparing the mean response μ across the levels of the categorical factor.
Wait . . .
Why is ANOVA using variances when we’re hypothesizing about means?Between-groups mean square: a varianceWithin-groups mean square: also a varianceF: a ratio of variances—F = MSBG/MSWG
What’s the Rationale?
In the simplest case of the one-way ANOVA, the variation in the response y is broken down into parts: variation in response attributed to the treatment (group/sample) and variation in response attributed to error (subject characteristics + everything else not controlled for)The variation in the treatment (group/sample) means is
compared to the variation within a treatment (group/sample) using a ratio—this is the F test statistic!
If the between treatment variation is a lot bigger than the within treatment variation, that suggests there are some different effects among the treatments.
Rationale
1
2
3
Rationale
There is an obvious difference between scenarios 1 and 2. What is it?
Just looking at the boxplots, which of the two scenarios (1 or 2) do you think would provide more evidence that at least one of the populations is different from the others? Why?
Rationale
1
2
3
F Statistic
Case A: If all the sample means were exactly the same, what would be the value of the numerator of the F statistic?
Case B: If all the sample means were spread out and very different, how would the variation between sample means compare to the value in A?
F = Variation between the sample means
Natural variation within the samples
F Statistic
So what values could the F statistic take on?Could you get an F that is negative?What type of values of F would lead you to believe
the null hypothesis—that there is no difference in group means—is not accurate?
F = Variation between the sample means
Natural variation within the samples
Smoking Cessation
ANOVA produces a table:
Conclusion: Reject H0: μ1 = μ2 = μ3. Some difference in the number of cigarettes smoked per day exists between subjects receiving the three types of therapy.
Smoking Cessation
ANOVA produces a table:
But where is the difference? Are the two experimental therapies different? Or is it that each are different from the control?
Reject H0: μ1 = μ3 and μ1 = μ2. Both pharmaceutical and behavioral therapy are significantly different from the literature only control group, but the two therapies are not different from each other.
Smoking Cessation
Smoking Cessation
Conclusion: Adding either behavioral (p = 0.015) or pharmaceutical therapy (p < 0.01) to cessation education results in—on average—significantly greater decreases in cigarettes smoked per day at six months post-therapy when compared to education alone.
Inferences on Means
Concerns a continuous response yOne or two groups: tMore than two groups: ANOVA
Remember, this (and the two-sample case) is essentially looking at the association between an x and a y, where x is categorical (nominal or ordinal) and y is continuous (interval or ratio).
Check assumptions!Normality of yEqual group variances
ANOVA Models
There are many . . . Randomized designs with one treatment A. Subjects not subdivided on any basis other than randomization prior to assignment to treatment
levels; no restriction on random assignment other than the option of assigning the same number of subjects to each treatment level 1. Completely randomized or one factor design
B. Subjects subdivided on some nonrandom basis or one or more restrictions on random assignment other than assigning the same number of subjects to each treatment level 1. Balanced incomplete block design 2. Crossover design 3. Generalized randomized block design 4. Graeco-Latin square design 5. Hyper-Graeco-Latin square design 6. Latin square design 7. Partially balanced incomplete block design 8. Randomized block design 9. Youden square design
Randomized designs with two or more treatments A. Factorial experiments: designs in which all treatment levels are crossed
1. Designs without confounding a. Completely randomized factorial design b. Generalized randomized factorial design c. Randomized block factorial design
2. Design with group-treatment confounding a. Split-plot factorial design
3. Designs with group-interaction confounding a. Latin square confounded factorial design b. Randomized block completely confounded factorial design
Inferences on Proportions (k = 2)
Example: plant geneticsTwo phenotypes: x = {yellow-flowered plants, green-
flowered plants}Dependent variable: y = proportion of plants out of 100
progeny that express each phenotypePhenotype
Yellow
Yellow
Green
Yellow
Green
xy =
n
Plant Genetics
The plant geneticist hypothesizes that his crossed progeny will result in a 3:1 phenotypic ratio of yellow-flowered to green-flowered plants.
H0: The population contains 75% yellow-flowered plants versus H1: The population does not contain 75% yellow-flowered plants.H0: πy = 0.75 versus H1: πy ≠ 0.75
This particular type of test is referred to as the chi-square goodness of fit test for k = 2.
Plant Genetics
Chi-square statistics compute deviations between what is expected (under H0) and what is actually observed in the data:
DF = k – 1 where
k is number of
categories of x
2
2
x
O E
E
Plant Genetics
Suppose the researcher actually observed in his sample of 100 plants this breakdown of phenotype:
Does it appear that this type of sample could have come from a population where the true proportion of yellow-flowered plants is 75%?
Phenotype f (%)
Yellow-flowered 84 (84%)
Green-flowered 16 (16%)
Plant Genetics
Conclusion: Reject H0: πy = 0.75—it does not appear that the geneticist’s hypothesis about the population phenotypic ratio is correct (p = 0.038).
Phenotype f (%)
Yellow-flowered 84 (84%)
Green-flowered 16 (16%)
2 2
21
84 75 16 254.32
75 25
Inferences on Proportions (k > 2)
Example: plant geneticsFour phenotypes: x = {yellow-smooth flowered, yellow-
wrinkled flowered, green-smooth flowered, green-wrinkled flowered}
Dependent variable: y = proportion of plants out of 250 progeny that express each phenotype
Phenotype
Yellow smooth
Yellow smooth
Green wrinkled
Yellow wrinkled
xy =
n
Plant Genetics
The plant geneticist hypothesizes that his crossed progeny will result in a 9:3:3:1 phenotypic ratio of YS:YW:GS:GW plants.
Actual numeric hypothesis is H0: π1 = 0.5625, π2 = 0.1875, π3 = 0.1875, π4 = 0.0625
This particular type of test is referred to as the chi-square goodness of fit test for k = 4.
Plant Genetics
Chi-square statistics compute deviations between what is expected (under H0) and what is actually observed in the data:
DF = k – 1 where
k is number of
categories of x
2
2
x
O E
E
Plant Genetics
Suppose the researcher actually observed in his sample of 250 plants this breakdown of phenotype:
Does it appear that this type of sample could have come from a population where the true phenotypic ratio is as the geneticist hypothesized?
Phenotype f (%)
YS 152 (60.8%)
YW 39 (15.6%)
GS 53 (21.2%)
GW 6 (2.4%)
Plant Genetics
Conclusion: Reject H0—it does not appear that the geneticist’s hypothesis about the population phenotypic ratio is correct (p = 0.03).
Phenotype f (%)
YS 152 (60.8%)
YW 39 (15.6%)
GS 53 (21.2%)
GW 6 (2.4%)
23 8.972
Inferences on Proportions
Concerns a categorical response yRegardless of the number of groups, a chi-square
test may be usedRemember, this is essentially looking at the association
between an x and a y, where x is categorical (nominal or ordinal) and y is categorical (nominal or ordinal).
Assumptions?ROT: No expected frequency should be less than 5 (i.e.,
nπ < 5)If not met, use the binomial (k = 2) or multinomial (k > 2)
test
Inferences on Proportions
What do we do when we have nominal data on more than one factor x?Gender and hair colorMenopausal status and disease stage at diagnosis‘Handedness’ and gender
We still use chi-square!These types of tests are looking at whether two
categorical variables are independent of one another—thus, tests of this type are often referred to as chi-square tests of independence.
Inferences on Proportions
Example: Hair color and GenderGender: x1 = {M, F}
Hair Color: x1 = {Black, Brown, Blonde, Red}
Black Brown Blonde Red Total
Male 32 (32%) 43 (43%) 16 (16%) 9 (9%) 100
Female 55 (27.5%) 65 (32.5%) 64 (32%) 16 (8%) 200
Total 87 108 80 25 N = 300
GenderHair
Color
Male Black
Female Red
Female Blonde
What the data should look like in the actual dataset:
Hair Color and Gender
The researcher hypothesizes that hair color is not independent of sex.
H0: Hair color is independent of gender (i.e., the phenotypic ratio is the same within each gender).
H1: Hair color is not independent of gender (i.e., the phenotypic ratio is different between genders).
Hair Color and Gender
Chi-square statistics compute deviations between what is expected (under H0) and what is actually observed in the data:
DF = (r – 1)(c – 1)
where r is number of
rows and c is
number of columns
2
2
x
O E
E
Hair Color and Gender
Does it appear that this type of sample could have come from a population where the different hair colors occur with the same frequency within each gender?
OR does it appear that the distribution of hair color is different between men and women?
Black Brown Blonde Red Total
Male 32 (32%) 43 (43%) 16 (16%) 9 (9%) 100
Female 55 (27.5%) 65 (32.5%) 64 (32%) 16 (8%) 200
Total 87 108 80 25 N = 300
Hair Color and Gender
Conclusion: Reject H0: Gender and Hair Color are independent. It appears that the researcher’s hypothesis that the population phenotypic ratio is different between genders is correct (p = 0.029).
Black Brown Blonde Red Total
Male 32 (32%) 43 (43%) 16 (16%) 9 (9%) 100
Female 55 (27.5%) 65 (32.5%) 64 (32%) 16 (8%) 200
Total 87 108 80 25 N = 300
23 7.815
Inferences on Proportions
Special case: when you have a 2X2 contingency table, you are actually testing a hypothesis concerning two population proportions: H0: π1 = π2
(i.e., the proportion of males who are blonde is the same as the proportion of females who are blonde).
Blonde Non-blonde Total
Male 16 (16%) 84 (84%) 100
Female 64 (32%) 136 (68%) 200
Total 80 (26.7%) 220 (73.3%) N = 300
Inferences on Proportions
When you have a single proportion and have a small sample, substitute the Binomial test which provides exact results.
The nonparametric Fisher Exact test can be always be used in place of the chi-square test when you have contingency table-like data (i.e., two categorical factors whose association is of interest)—it should be substituted for the chi-square test of independence when ‘cell’ sizes are small.
Next Time
Linear Regression and CorrelationSurvival AnalysisFinal Thoughts