Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see...

Hypothesis Testing

2

Select 50% users to see headline A◦ Titanic Sinks

Select 50% users to see headline B◦ Ship Sinks Killing Thousands

Do people click more on headline A or B?

The New York Times Daily Dilemma

3

Two Populations

Testing Hypotheses

12

?

?

109

?

?

4

?

Which one has the largest average?

?

?

4

The two-sample t-test

Is difference in averages between two groups more than we would expect based on chance alone?

5

More Broadly: Hypothesis Testing Procedures

Hypothesis Testing

Procedures

Parametric

Z Test t Test Cohen's d

Nonparametric

Wilcoxon Rank Sum

Test

Kruskal-Walli H-

Test

Kolmogorov-Smirnov

test

6

Parametric Test Procedures

Tests Population Parameters (e.g. Mean)

Distribution Assumptions (e.g. Normal distribution)

Examples: Z Test, t-Test, 2 Test, F test

7

Nonparametric Test Procedures

Not Related to Population ParametersExample: Probability Distributions, Independence

Data Values not Directly UsedUses Ordering of Data

Examples: Wilcoxon Rank Sum Test , Komogorov-Smirnov Test

8

In class experiment with R

Left= c(20,5,500,15,30) Right = c(0,50,70,100)

t.test(Left, Right, alternative=c("two.sided","less","greater"), var.equal=TRUE, conf.level=0.95)

Two-sample t-Test

t-Test (Independent Samples)

H0: μ1 - μ2 = 0

H1: μ1 - μ2 ≠ 0

The goal is to evaluate if the average difference between two populations is zero

The t-test makes the following assumptions

• The values in X(0) and X(1) follow a normal distribution

• Observations are independent

Two hypotheses:

9

General t formula

t = sample statistic - hypothesized population parameter estimated standard error

Independent samples t

Empirical averages

Estimated standard deviation??

t-Test Calculation

10

Standard deviation of difference in empirical averages

t-Test: Standard Deviation Calculation

How much variance when we use average difference of observations to represent the true average difference?

11

t-Test: Standard Deviation Calculation (2/2)

Standard deviation of difference in empirical averages

with degrees of freedom

Also known as Welsh’s t

Sample variance of X(0)

Number of observationsin X(0)

12

13

What is the p-value?

Can we ever accept hypothesis H1 ?

t-Statistics p-valueH0: μ1 - μ2 = 0

H1: μ1 - μ2 ≠ 0

14

Effect Size

t-Test tests only if the difference is zero or not.

What about effect size?

Cohen’s d

where s is the pooled variance

t-Test: Effect Size

15

16

Bayesian Approach

17

Probability of hypothesis given data

The Bayes factor

Bayesian Approach

18

Example of Nonparametric Test

19

Two-sample Kolmogorov-Smirnov Test◦ Do X(0) and X(1) come from same underlying distribution?

◦ Hypothesis (same distribution)rejected at level p if

Nonparametric Testing of Distributions

Wikipedia

Em

pir

ical

Confidence interval factor

Sample size correction

The K-S test is less sensitive when the differences between curves is greatest at the beginning or the end of the distributions. Works best when distributions differ at center.

Good reading: M. Tygert, Statistical tests for whether a given set of independent, identically distributed draws comes from a specified probability density. PNAS 2010

20

Are Two User Features Independent?

Twitter users can have gender and number of tweets.

We want to determine whether gender is related to number of tweets.

Use chi-square test for independence

Chi-Squared Test

21

When to use chi-square test for independence:◦ Uniform sampling design◦ Categorical features◦ Population is significantly larger than sample

State the hypotheses:◦ H0 ?

◦ H1 ?

When to use Chi-Squared test

22

men = c(300, 100, 40)

women = c(350, 200, 90)

data = as.data.frame(rbind(men, women))

names(data) = c('low', 'med', 'large')

data

chisq.test(data)

Reject H0 (p<0.05) means …

Example Chi-Squared Test

23

24

Deciding Headlines

25

Select 50% users to see headline A◦ Titanic Sinks

Select 50% users to see headline B◦ Ship Sinks Killing Thousands

Assign half the readers to headline A and half to headline B?◦ Yes?◦ No?◦ Which test to use?

What happens A is MUCH better than B?

Revisiting The New York Times Dilemma

26

How to stop experiment early if hypothesis seems true◦ Stopping criteria often needs to be decided before

experiment starts◦ If ever needed:

Sequential Analysis (Sequential Hypothesis Test)

27

But there is a better way…

28

K distinct hypotheses (so far we had K = 2)◦ Hypothesis = choosing NYT headline

Each time we pull arm i we get reward Xi

(simple version of problem)

Underlying population (reward distribution) does not change over time

Bandit algorithms attempt to minimize regret◦ If n = total actions ; ni = total actions i

Bandit Algorithms

largest true average

29

Note that regret is defined over the true average reward

How can we estimate true average reward Xi?◦ We need to get lots of observations from population i◦ But what happens if E[Xi] is small?

Core of decision making problems: ◦ Exploration vs. exploitation◦ When exploring we seek to improve estimated average reward ◦ When exploiting we try what has worked better in the past

Balancing exploration and exploitation: ◦ Instead of trying the action with highest estimated average, we

try the action with the highest upper bound on its confidence interval(more on this next class)

Challenge

30

Multi-Armed Bandit (MAB)◦ Bandit process is a special type of Markov Decision

Process◦ Generally, reward Xi(ni) at ni–th arm pull of arm i is P[Xi(ni) |

Xi(ni-1)]

UCB1◦ Use arm i that maximizes

UCB1 (Upper Confidence Bound 1)

31

R ExamplenumT <- 2500 # number of time stepsttest <- c()# mean of population 1mean1 = 0.4# mean of population 2mean2 = 0.7# initialize observationsx1 <- c(rbinom(n=1,size=1,prob=mean1))x2 <- c(rbinom(n=1,size=1,prob=mean2))n1 = 1n2 = 1for (i in 2:numT){ # compute reward of bandit 1 reward_1 = mean(x1) + sqrt(2*log(i)/n1) # compute reward of bandit 2 reward_2 = mean(x2) + sqrt(2*log(i)/n2) # decides which arm to pull if (reward_1 > reward_2) { x1 <- c(rbinom(n=1,size=1,prob=mean1),x1) n1 = n1 + 1 } else { x2 <- c(rbinom(n=1,size=1,prob=mean2),x2) n2 = n2 + 1 } # computes the t-Test p-value of observations if ((n1 > 2) && (n2 > 2)) { d <- t.test(x1,x2)$p.value } else { d = 1 } ttest <- c(ttest, d)}par(mfrow=c(1,2))plot(2:numT,ttest,"l",xlab="Time",ylab="t-Test pvalue",log="y")barplot(c(n1,n2),xlab="Arm Pulls",ylab="Observations",names.arg=c("n0", "n1"))

Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see...

Documents

Transcript of Hypothesis Testing. Select 50% users to see headline A ◦ Titanic Sinks Select 50% users to see...