z -squared: the origin and use of χ²

58
z z -squared: the origin and use -squared: the origin and use of χ² of χ² - or - what I wish I had been told about statistics (but had to work out for myself) Sean Wallis Survey of English Usage University College London [email protected]

description

z -squared: the origin and use of χ². - or - what I wish I had been told about statistics (but had to work out for myself). Sean Wallis Survey of English Usage University College London [email protected]. Outline. What is the point of statistics? Linguistic alternation experiments - PowerPoint PPT Presentation

Transcript of z -squared: the origin and use of χ²

Page 1: z -squared: the origin and use of χ²

zz-squared: the origin and use -squared: the origin and use of χ²of χ²- or -

what I wish I had been told about statistics (but had to work out for

myself)Sean Wallis

Survey of English Usage University College London

[email protected]

Page 2: z -squared: the origin and use of χ²

OutlineOutline• What is the point of statistics?

– Linguistic alternation experiments– How inferential statistics works

• Introducing z tests– Two types (single-sample and two-sample)– How these tests are related to χ²

• Comparing experiments and ‘effect size’– Swing and ‘skew’

• Low frequency events and small samples

Page 3: z -squared: the origin and use of χ²

What is the point of statistics?What is the point of statistics?• Analyse data you already have

– corpus linguistics• Design new experiments

– collect new data, add annotation– experimental linguistics in the lab

• Try new methods– pose the right question

• We are going to focus onz and χ² tests

Page 4: z -squared: the origin and use of χ²

What is the point of statistics?What is the point of statistics?• Analyse data you already have

– corpus linguistics• Design new experiments

– collect new data, add annotation– experimental linguistics in the lab

• Try new methods– pose the right question

• We are going to focus onz and χ² tests

experimental science}

observational science}

philosophy of science}

a little maths}

Page 5: z -squared: the origin and use of χ²

What is ‘What is ‘inferentialinferential statistics’?statistics’?• Suppose we carry out an experiment

– We toss a coin 10 times and get 5 heads– How confident are we in the results?

• Suppose we repeat the experiment• Will we get the same result again?

• Inferential statistics is a method of inferring the behaviour of future ‘ghost’ experiments from one experiment– Infer from the sample to the population

• Let us consider one type of experiment– Linguistic alternation experiments

Page 6: z -squared: the origin and use of χ²

Alternation experimentsAlternation experiments• Imagine a speaker forming a sentence

as a series of decisions/choices. They can– add: choose to extend a phrase or clause,

or stop– select: choose between constructions

• Choices will be constrained – grammatically– semantically

Page 7: z -squared: the origin and use of χ²

Alternation experimentsAlternation experiments• Imagine a speaker forming a sentence as a

series of decisions/choices. They can– add: choose to extend a phrase or clause, or stop– select: choose between constructions

• Choices will be constrained – grammatically– semantically

• Research question: – within these constraints,

what factors influence the particular choice?

Page 8: z -squared: the origin and use of χ²

Alternation experimentsAlternation experiments• Laboratory experiment (cued)

– pose the choice to subjects – observe the one they make– manipulate different potential influences

• Observational experiment (uncued)– observe the choices speakers make when they

make them (e.g. in a corpus)– extract data for different potential influences

• sociolinguistic: subdivide data by genre, etc• lexical/grammatical: subdivide data by elements in

surrounding context

Page 9: z -squared: the origin and use of χ²

Statistical assumptionsStatistical assumptionsA random sample taken from the population

– Not always easy to achieve• multiple cases from the same text and speakers, etc• may be limited historical data available

– Be careful with data concentrated in a few textsThe sample is tiny compared to the

population– This is easy to satisfy in linguistics!

Repeated sampling tends to form a Binomial distribution– This requires slightly more explanation...

Page 10: z -squared: the origin and use of χ²

The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a

Binomial distribution– We toss a coin 10 times, and get 5 heads:

F

N = 1

x

531 7 9

Page 11: z -squared: the origin and use of χ²

The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a

Binomial distributionF

N = 4

x

531 7 9

Page 12: z -squared: the origin and use of χ²

The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a

Binomial distributionF

N = 8

x

531 7 9

Page 13: z -squared: the origin and use of χ²

The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a

Binomial distributionF

N = 12

x

531 7 9

Page 14: z -squared: the origin and use of χ²

The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a

Binomial distributionF

N = 16

x

531 7 9

Page 15: z -squared: the origin and use of χ²

The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a

Binomial distributionF

N = 20

x

531 7 9

Page 16: z -squared: the origin and use of χ²

The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a

Binomial distributionF

N = 24

x

531 7 9

Page 17: z -squared: the origin and use of χ²

Binomial Binomial Normal Normal• The Binomial (discrete) distribution tends to

match the Normal (continuous) distribution

x

F

531 7 9

Page 18: z -squared: the origin and use of χ²

The central limit theoremThe central limit theorem• Any Normal distribution can be defined by

only two variables and the Normal function z

z . s z . s

F

– With more data in the experiment, s will be smaller

p0.50.30.1 0.7

– Divide by 10 for probability scale

population mean x = P

standard deviations = P(1 – P) / n

Page 19: z -squared: the origin and use of χ²

The central limit theoremThe central limit theorem• Any Normal distribution can be defined by

only two variables and the Normal function z

z . s z . s

F

2.5% 2.5%

population mean x = P

– 95% of the curve is within ~2 standard deviations of the mean

(the correct figure is

1.95996!)

standard deviations = P(1 – P) / n

p0.50.30.1 0.7

95%

Page 20: z -squared: the origin and use of χ²

The single-sample The single-sample zz test...test...• Is an observation > z standard deviations from the

expected population mean?– If yes, the result is significant

z . s z . s

F

P0.25% 0.25%

p0.50.30.1 0.7

observation p

Page 21: z -squared: the origin and use of χ²

...gives us a “confidence ...gives us a “confidence interval”interval”• P ± z . s is the confidence interval for P

– Enough for a test

z . s z . s

F

P0.25% 0.25%

p0.50.30.1 0.7

Page 22: z -squared: the origin and use of χ²

...gives us a “confidence ...gives us a “confidence interval”interval”• P ± z . s is the confidence interval for P

– But we need the interval about p

w+

F

P0.25% 0.25%

p0.50.30.1 0.7

observation p

w–

Page 23: z -squared: the origin and use of χ²

...gives us a “confidence ...gives us a “confidence interval”interval”• The interval about p is called the

Wilson score interval

w+

F

P0.25% 0.25%

p0.50.30.1 0.7

observation p

w–

• This interval is asymmetric

• It reflects the Normal interval about P:

• If P is at the upper limit of p,p is at the lower limit of P

(Wilson, 1927)

Page 24: z -squared: the origin and use of χ²

...gives us a “confidence ...gives us a “confidence interval”interval”• The interval about p is called the

Wilson score interval

w+

F

P0.25% 0.25%

p0.50.30.1 0.7

observation p

w–

• To calculate w– and w+ we use this formula:

nz

nz

nppz

nzp

2

2

22

1

4)1(

2

(Wilson, 1927)

Page 25: z -squared: the origin and use of χ²

Plotting confidence intervalsPlotting confidence intervals• E.g. Plot the probability of adding

successive attributive adjectives to a NP in ICE-GB– You can easily see that the first two falls

are significant, but the last is not

0.00

0.05

0.10

0.15

0.20

0.25

0 1 2 3 4

p

Page 26: z -squared: the origin and use of χ²

A simple experimentA simple experiment• Consider two binary variables, A and B

– Each one is subdivided: • A = {a, ¬a} e.g. NP has AJP? {yes, no}• B = {b, ¬b} e.g. Speaker gender {male, female}

– Does B ‘affect’ A?• We perform an experiment

(or sample a corpus)– We find 45 cases (NPs)

classified by A and B (left)– This is a ‘contingency table’

Page 27: z -squared: the origin and use of χ²

A simple experimentA simple experiment• Consider two binary variables, A and B

– Each one is subdivided: • A = {a, ¬a} e.g. NP has AJP? {yes, no}• B = {b, ¬b} e.g. Speaker gender {male, female}

– Does B ‘affect’ A?• We perform an experiment

(or sample a corpus)– We find 45 cases (NPs)

classified by A and B (left)– This is a ‘contingency table’

• Q1. Does B cause a to differ from A?– Does speaker gender affect decision to include an AJP?

a ¬a b 20 5 25¬b 10 10 20 30 15 45

A = dependent variable

B = independent variable

Page 28: z -squared: the origin and use of χ²

Does Does BB cause cause aa to differ from to differ from AA??• Compare column 1 (a) and column 3 (A)

– Probability of picking b at random (gender = male)

• p(b) = 25/45 = 5/9 = 0.556 a ¬a b 20 5 25¬b 10 10 20 30 15 45

Page 29: z -squared: the origin and use of χ²

Does Does BB cause cause aa to differ from to differ from AA??• Compare column 1 (a) and column 3 (A)

– Probability of picking b at random (gender = male) • p(b) = 25/45 = 5/9 = 0.556

• Next, examine a (has AJP)– New probability of picking b

• p(b | a) = 20/30 = 2/3 = 0.667– Confidence interval for p(b | a)

• population standard deviations = p(b)(1–p(b))/n = (5/9

4/9) / 30• p z.s = (0.489, 0.845)

a ¬a b 20 5 25¬b 10 10 20 30 15 45

Page 30: z -squared: the origin and use of χ²

Does Does BB cause cause aa to differ from to differ from AA??• Compare column 1 (a) and column 3 (A)

– Probability of picking b at random (gender = male) • p(b) = 25/45 = 5/9 = 0.556

• Next, examine a (has AJP)– New probability of picking b

• p(b | a) = 20/30 = 2/3 = 0.667– Confidence interval for p(b)

• population standard deviations = p(b)(1–p(b))/n = (5/9

4/9) / 30• p z.s = (0.378, 0.733)

• Not significant: p(b | a) is inside c.i. for p(b)

a ¬a b 20 5 25¬b 10 10 20 30 15 45

Page 31: z -squared: the origin and use of χ²

Visualising this testVisualising this test• Confidence interval for p(b)

– P = expected value E = expected distribution

z . s z . s

F

P

p0.556

0.378 0.733

p(b) 0.667p

E

p(b | a)

A a

p(b)

Page 32: z -squared: the origin and use of χ²

The single-sample The single-sample zz testtest• Compares an observation with a given value

– We used it to compare p(b | a) with p(b)– This is a “goodness of fit” test– Identical to a standard 21 χ² test– No need to test p(¬b | a) with p(¬b)

• Note that p(b) is given– All of the variation is assumed to be

in the estimation of p(b | a)– Could also compare p(b | ¬a) (no AJP) with p(b)

• Q2. Does B cause a to differ from ¬a?– Does speaker gender affect presence / absence of AJP?

p

E

A a

Page 33: z -squared: the origin and use of χ²

zz test for 2 independent test for 2 independent proportionsproportions• Method: combine observed values

– take the difference (subtract) |p1 – p2|– calculate an ‘averaged’ confidence interval

p

O1

O2

p

O1

O2

¬a a

F p2 = p(b | ¬a)

p1 = p(b | a)p2

p1

Page 34: z -squared: the origin and use of χ²

zz test for 2 independent test for 2 independent proportionsproportions• New confidence interval D = |O1 – O2|

– standard deviation s' = p(1 – p) (1/n1 +1/n2)– p = p(b) = 25/45 = 5/9– compare

z.s' with x = |p1 – p2|

D

p

D

x difference in p

x = |p1 – p2|

^ ^

^

z.s'a ¬a

b 20 5 25¬b 10 10 20 30 15 45

n1 n2 mean x = 00

Page 35: z -squared: the origin and use of χ²

Does Does BB cause cause aa to differ from to differ from ¬¬aa??• Compare column 1 (a) and column 2 (¬a)

– Probabilities (speaker gender = male)• p(b | a) = 20/30 = 2/3 = 0.667• p(b | ¬a) = 5/15 = 1/3 = 0.333

– Confidence interval• pooled probability estimate

p = p(b) = 5/9 = 0.556• standard deviation

s' = p(1 – p) (1/n1 + 1/n2) = (5/9

4/9) (1/30 + 1/15)

• z.s' = 0.308

a ¬a b 20 5 25¬b 10 10 20 30 15 45

^

^ ^

Page 36: z -squared: the origin and use of χ²

Does Does BB cause cause aa to differ from to differ from ¬¬aa??• Compare column 1 (a) and column 2 (¬a)

– Probabilities (speaker gender = male)• p(b | a) = 20/30 = 2/3 = 0.667• p(b | ¬a) = 5/15 = 1/3 = 0.333

– Confidence interval• pooled probability estimate

p = p(b) = 5/9 = 0.556• standard deviation

s' = p(1 – p) (1/n1 + 1/n2) = (5/9

4/9) (1/30 + 1/15)

• z.s' = 0.308

• Significant: |p(b | a) – p(b | ¬a)| > z.s'

a ¬a b 20 5 25¬b 10 10 20 30 15 45

^

^ ^

Page 37: z -squared: the origin and use of χ²

zz test for 2 independent test for 2 independent proportionsproportions• Identical to a standard 22 χ² test

– So you can use the usual method!

Page 38: z -squared: the origin and use of χ²

zz test for 2 independent test for 2 independent proportionsproportions• Identical to a standard 22 χ² test

– So you can use the usual method!• BUT: these tests have different

purposes– 21 goodness of fit compares

single value a with superset A• assumes only a varies

– 22 test compares two valuesa, ¬a within a set A

• both values may vary

A

a

g.o.

f.

2

2 2 2

¬a

Page 39: z -squared: the origin and use of χ²

zz test for 2 independent test for 2 independent proportionsproportions• Identical to a standard 22 χ² test

– So you can use the usual method!• BUT: these tests have different purposes

– 21 goodness of fit compares single value a with superset A

• assumes only a varies– 22 test compares two values

a, ¬a within a set A• both values may vary

• Q: Do we need χ²?

A

a

g.o.

f.

2

2 2 2

¬a

Page 40: z -squared: the origin and use of χ²

Larger χ² testsLarger χ² tests• χ² is popular because it can be applied to

contingency tables with many values• r 1 goodness of fit χ² tests (r 2)• r c χ² tests for homogeneity (r,c 2)

• z tests have 1 degree of freedom• strength: significance is due to only one source• strength: easy to plot values and confidence intervals• weakness: multiple values may be unavoidable

• With larger χ² tests, evaluate and simplify:• Examine χ² contributions for each row or column• Focus on alternation - try to test for a speaker choice

Page 41: z -squared: the origin and use of χ²

How big is the effect?How big is the effect?• These tests do not measure the

strength of the interaction between two variables– They test whether the strength of an

interaction is greater than would be expected by chance

• With lots of data, a tiny change would be significant

Page 42: z -squared: the origin and use of χ²

How big is the effect?How big is the effect?• These tests do not measure the strength

of the interaction between two variables– They test whether the strength of an

interaction is greater than would be expected by chance

• With lots of data, a tiny change would be significant– Don’t use χ², p or z values to compare two

different experiments• A result significant at p<0.01 is not ‘better’ than one

significant at p<0.05

Page 43: z -squared: the origin and use of χ²

How big is the effect?How big is the effect?• These tests do not measure the strength of

the interaction between two variables– They test whether the strength of an interaction

is greater than would be expected by chance• With lots of data, a tiny change would be significant

– Don’t use χ², p or z values to compare two different experiments

• A result significant at p<0.01 is not ‘better’ than one significant at p<0.05

• There are a number of ways of measuring ‘association strength’ or ‘effect size’

Page 44: z -squared: the origin and use of χ²

Percentage swingPercentage swing• Compare probabilities of a DV value (a,

AJP) across a change in the IV (gender):– swing d = p(a | ¬b) – p(a | b) = 10/20 – 20/25 = -0.3

a ¬a b 20 5 25¬b 10 10 20 30 15 45

Page 45: z -squared: the origin and use of χ²

Percentage swingPercentage swing• Compare probabilities of a DV value (a,

AJP) across a change in the IV (gender):– swing d = p(a | ¬b) – p(a | b) = 10/20 – 20/25 = -0.3

• As a proportion of theinitial value– % swing d

% = d/p(a | b) = -0.3/0.8

a ¬a b 20 5 25¬b 10 10 20 30 15 45

Page 46: z -squared: the origin and use of χ²

Percentage swingPercentage swing• Compare probabilities of a DV value (a, AJP)

across a change in the IV (gender):– swing d = p(a | ¬b) – p(a | b) = 10/20 – 20/25 = -0.3

• As a proportion of theinitial value– % swing d

% = d/p(a | b) = -37.5%

• We can even calculateconfidence intervals on d or d

%

– Use z test for two independent proportions(we are comparing differences in p values)

a ¬a b 20 5 25¬b 10 10 20 30 15 45

Page 47: z -squared: the origin and use of χ²

Cramér’s φCramér’s φ• Can be used on any χ² table

– Mathematically well defined– Probabilistic (c.f. swing d [-1, +1], d

% = ?) = 0 no relationship between A and B = 1 B strictly determines A• straight line between these two extremes

a ¬a b 0.5 0.5 1¬b 0.5 0.5 1 1 1 2

a ¬a b 1 0 1¬b 0 1 1 1 1 2

= 0 = 1

Page 48: z -squared: the origin and use of χ²

Cramér’s φCramér’s φ• Can be used on any χ² table

– Mathematically well defined – Probabilistic (c.f. swing d [-1, +1], d

% = ?) = 0 no relationship between A and B = 1 B strictly determines A• straight line between these two extremes

a ¬a b 0.5 0.5 1¬b 0.5 0.5 1 1 1 2

a ¬a b 1 0 1¬b 0 1 1 1 1 2

= 0 = 1

‘averaged’ swing}

Page 49: z -squared: the origin and use of χ²

Cramér’s φCramér’s φ• Can be used on any χ² table

– Mathematically well defined – Probabilistic (c.f. swing d [-1, +1], d

% = ?) = 0 no relationship between A and B = 1 B strictly determines A• straight line between these two extremes

– Based on χ² = χ²/N (22) N = grand total c = χ²/(k – 1)N (r c ) k = min(r, c)

Page 50: z -squared: the origin and use of χ²

Cramér’s φCramér’s φ• Can be used on any χ² table

– Mathematically well defined– Probabilistic (c.f. swing d [-1, +1], d

% = ?) = 0 no relationship between A and B = 1 B strictly determines A• straight line between these two extremes

– Based on χ² = χ²/N (22) N = grand total c = χ²/(k – 1)N (r c ) k = min(r, c)

• Can be used for r 1 goodness of fit tests – Recalibrate using methods in Wallis (2012)– Better indicator than percentage swing

Page 51: z -squared: the origin and use of χ²

Significantly Significantly better?better?• Suppose we have two similar

experiments– How do we test if one result is significantly

stronger than another?

Page 52: z -squared: the origin and use of χ²

Significantly Significantly better?better?• Suppose we have two similar

experiments– How do we test if one result is significantly

stronger than another?• Test swings • Use z test for two samples from

different populations• Use s' = s1

2 + s22

• Test |d1(a) – d2(a)| > z.s'

a ¬a b 20 5 25¬b 10 10 20 30 15 45

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

d1(a) d2(a)

a ¬a b 50 5 55¬b 10 10 20 30 15 75

Page 53: z -squared: the origin and use of χ²

Modern improvements on Modern improvements on zz

and χ² and χ² • ‘Continuity correction’ for small n

– Yates’ χ2 test

Page 54: z -squared: the origin and use of χ²

Modern improvements on Modern improvements on zz

and χ² and χ² • ‘Continuity correction’ for small n

– Yates’ χ2 test• Wilson’s score interval

– The correct formula for intervals on p

p

p

w– w+

0

Page 55: z -squared: the origin and use of χ²

Modern improvements on Modern improvements on zz

and χ² and χ² • ‘Continuity correction’ for small n

– Yates’ χ2 test – can be used elsewhere• Wilson’s score interval

– The correct formula for intervals on p• Newcombe (1998)

improves on 22 χ² test– Uses the Wilson interval– Better than χ² and

log-likelihood (etc.)for low-frequency events

p

p

w– w+

0

Page 56: z -squared: the origin and use of χ²

ConclusionsConclusions• The basic idea of all of these tests is

– Predict future results if experiment were repeated• ‘Significant’ = effect > 0 (e.g. 19 times out of 20)

• Based on the Binomial distribution– Approximated by Normal distribution – many uses

• Plotting confidence intervals• Use goodness of fit or single-sample z tests to

compare a sample, a, with a point it is dependent on, A• Use 22 tests or two independent sample z tests to

compare two observed samples (a, ¬a)• When using larger r c tests, simplify as far as

possible to identify the source of variation!

Page 57: z -squared: the origin and use of χ²

ConclusionsConclusions• Two methods for measuring the ‘size’ of an

experimental effect– Simple idea, easy to report

• absolute or percentage swing– More reliable, but possibly less intuitive

• Cramér’s φ– You can compare two experiments

• Is absolute swing significantly greater?• Use a type of z test!• A similar approach is possible with φ

• Take care with small samples / low frequencies– Use Wilson and Newcombe’s methods instead!

Page 58: z -squared: the origin and use of χ²

ReferencesReferences• Newcombe, R.G. 1998. Interval estimation for the

difference between independent proportions: comparison of eleven methods. Statistics in Medicine 17: 873-890

• Wallis, S.A. 2009. Binomial distributions, probability and Wilson’s confidence interval. London: Survey of English Usage

• Wallis, S.A. 2010. z-squared: The origin and use of χ². London: Survey of English Usage

• Wallis, S.A. 2012. Goodness of fit measures for discrete categorical data. London: Survey of English Usage

• Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: 209-212

• Assorted statistical tests:– www.ucl.ac.uk/english-usage/staff/sean/resources/

2x2chisq.xls