Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see...
Transcript of Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see...
![Page 1: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/1.jpg)
Hypothesis Testing
![Page 2: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/2.jpg)
2
Select 50% users to see headline A◦ Titanic Sinks
Select 50% users to see headline B◦ Ship Sinks Killing Thousands
Do people click more on headline A or B?
The New York Times Daily Dilemma
![Page 3: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/3.jpg)
3
Two Populations
Testing Hypotheses
12
?
?
109
?
?
4
?
Which one has the largest average?
?
?
![Page 4: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/4.jpg)
4
The two-sample t-test
Is difference in averages between two groups more than we would expect based on chance alone?
![Page 5: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/5.jpg)
5
More Broadly: Hypothesis Testing Procedures
Hypothesis Testing
Procedures
Parametric
Z Test t Test Cohen's d
Nonparametric
Wilcoxon Rank Sum
Test
Kruskal-Walli H-
Test
Kolmogorov-Smirnov
test
![Page 6: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/6.jpg)
6
Parametric Test Procedures
Tests Population Parameters (e.g. Mean)
Distribution Assumptions (e.g. Normal distribution)
Examples: Z Test, t-Test, 2 Test, F test
![Page 7: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/7.jpg)
7
Nonparametric Test Procedures
Not Related to Population ParametersExample: Probability Distributions, Independence
Data Values not Directly UsedUses Ordering of Data
Examples: Wilcoxon Rank Sum Test , Komogorov-Smirnov Test
![Page 8: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/8.jpg)
8
In class experiment with R
Left= c(20,5,500,15,30) Right = c(0,50,70,100)
t.test(Left, Right, alternative=c("two.sided","less","greater"), var.equal=TRUE, conf.level=0.95)
Two-sample t-Test
![Page 9: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/9.jpg)
t-Test (Independent Samples)
H0: μ1 - μ2 = 0
H1: μ1 - μ2 ≠ 0
The goal is to evaluate if the average difference between two populations is zero
The t-test makes the following assumptions
• The values in X(0) and X(1) follow a normal distribution
• Observations are independent
Two hypotheses:
9
![Page 10: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/10.jpg)
General t formula
t = sample statistic - hypothesized population parameter estimated standard error
Independent samples t
Empirical averages
Estimated standard deviation??
t-Test Calculation
10
![Page 11: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/11.jpg)
Standard deviation of difference in empirical averages
t-Test: Standard Deviation Calculation
How much variance when we use average difference of observations to represent the true average difference?
11
![Page 12: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/12.jpg)
t-Test: Standard Deviation Calculation (2/2)
Standard deviation of difference in empirical averages
with degrees of freedom
Also known as Welsh’s t
Sample variance of X(0)
Number of observationsin X(0)
12
![Page 13: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/13.jpg)
13
What is the p-value?
Can we ever accept hypothesis H1 ?
t-Statistics p-valueH0: μ1 - μ2 = 0
H1: μ1 - μ2 ≠ 0
![Page 14: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/14.jpg)
14
Effect Size
![Page 15: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/15.jpg)
t-Test tests only if the difference is zero or not.
What about effect size?
Cohen’s d
where s is the pooled variance
t-Test: Effect Size
15
![Page 16: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/16.jpg)
16
Bayesian Approach
![Page 17: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/17.jpg)
17
Probability of hypothesis given data
The Bayes factor
Bayesian Approach
![Page 18: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/18.jpg)
18
Example of Nonparametric Test
![Page 19: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/19.jpg)
19
Two-sample Kolmogorov-Smirnov Test◦ Do X(0) and X(1) come from same underlying distribution?
◦ Hypothesis (same distribution)rejected at level p if
Nonparametric Testing of Distributions
Wikipedia
Em
pir
ical
Confidence interval factor
Sample size correction
The K-S test is less sensitive when the differences between curves is greatest at the beginning or the end of the distributions. Works best when distributions differ at center.
Good reading: M. Tygert, Statistical tests for whether a given set of independent, identically distributed draws comes from a specified probability density. PNAS 2010
![Page 20: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/20.jpg)
20
Are Two User Features Independent?
![Page 21: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/21.jpg)
Twitter users can have gender and number of tweets.
We want to determine whether gender is related to number of tweets.
Use chi-square test for independence
Chi-Squared Test
21
![Page 22: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/22.jpg)
When to use chi-square test for independence:◦ Uniform sampling design◦ Categorical features◦ Population is significantly larger than sample
State the hypotheses:◦ H0 ?
◦ H1 ?
When to use Chi-Squared test
22
![Page 23: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/23.jpg)
men = c(300, 100, 40)
women = c(350, 200, 90)
data = as.data.frame(rbind(men, women))
names(data) = c('low', 'med', 'large')
data
chisq.test(data)
Reject H0 (p<0.05) means …
Example Chi-Squared Test
23
![Page 24: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/24.jpg)
24
Deciding Headlines
![Page 25: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/25.jpg)
25
Select 50% users to see headline A◦ Titanic Sinks
Select 50% users to see headline B◦ Ship Sinks Killing Thousands
Assign half the readers to headline A and half to headline B?◦ Yes?◦ No?◦ Which test to use?
What happens A is MUCH better than B?
Revisiting The New York Times Dilemma
![Page 26: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/26.jpg)
26
How to stop experiment early if hypothesis seems true◦ Stopping criteria often needs to be decided before
experiment starts◦ If ever needed:
Sequential Analysis (Sequential Hypothesis Test)
![Page 27: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/27.jpg)
27
But there is a better way…
![Page 28: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/28.jpg)
28
K distinct hypotheses (so far we had K = 2)◦ Hypothesis = choosing NYT headline
Each time we pull arm i we get reward Xi
(simple version of problem)
Underlying population (reward distribution) does not change over time
Bandit algorithms attempt to minimize regret◦ If n = total actions ; ni = total actions i
Bandit Algorithms
largest true average
![Page 29: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/29.jpg)
29
Note that regret is defined over the true average reward
How can we estimate true average reward Xi?◦ We need to get lots of observations from population i◦ But what happens if E[Xi] is small?
Core of decision making problems: ◦ Exploration vs. exploitation◦ When exploring we seek to improve estimated average reward ◦ When exploiting we try what has worked better in the past
Balancing exploration and exploitation: ◦ Instead of trying the action with highest estimated average, we
try the action with the highest upper bound on its confidence interval(more on this next class)
Challenge
![Page 30: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/30.jpg)
30
Multi-Armed Bandit (MAB)◦ Bandit process is a special type of Markov Decision
Process◦ Generally, reward Xi(ni) at ni–th arm pull of arm i is P[Xi(ni) |
Xi(ni-1)]
UCB1◦ Use arm i that maximizes
UCB1 (Upper Confidence Bound 1)
![Page 31: Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see headline B ◦ Ship Sinks Killing Thousands Do people.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f265503460f94c3ce7b/html5/thumbnails/31.jpg)
31
R ExamplenumT <- 2500 # number of time stepsttest <- c()# mean of population 1mean1 = 0.4# mean of population 2mean2 = 0.7# initialize observationsx1 <- c(rbinom(n=1,size=1,prob=mean1))x2 <- c(rbinom(n=1,size=1,prob=mean2))n1 = 1n2 = 1for (i in 2:numT){ # compute reward of bandit 1 reward_1 = mean(x1) + sqrt(2*log(i)/n1) # compute reward of bandit 2 reward_2 = mean(x2) + sqrt(2*log(i)/n2) # decides which arm to pull if (reward_1 > reward_2) { x1 <- c(rbinom(n=1,size=1,prob=mean1),x1) n1 = n1 + 1 } else { x2 <- c(rbinom(n=1,size=1,prob=mean2),x2) n2 = n2 + 1 } # computes the t-Test p-value of observations if ((n1 > 2) && (n2 > 2)) { d <- t.test(x1,x2)$p.value } else { d = 1 } ttest <- c(ttest, d)}par(mfrow=c(1,2))plot(2:numT,ttest,"l",xlab="Time",ylab="t-Test pvalue",log="y")barplot(c(n1,n2),xlab="Arm Pulls",ylab="Observations",names.arg=c("n0", "n1"))