z -squared: the origin and use of χ²

zz-squared: the origin and use -squared: the origin and use of χ²of χ²- or -

what I wish I had been told about statistics (but had to work out for

myself)Sean Wallis

Survey of English Usage University College London

[email protected]

OutlineOutline• What is the point of statistics?

– Linguistic alternation experiments– How inferential statistics works

• Introducing z tests– Two types (single-sample and two-sample)– How these tests are related to χ²

• Comparing experiments and ‘effect size’– Swing and ‘skew’

• Low frequency events and small samples

What is the point of statistics?What is the point of statistics?• Analyse data you already have

– corpus linguistics• Design new experiments

– collect new data, add annotation– experimental linguistics in the lab

• Try new methods– pose the right question

• We are going to focus onz and χ² tests

What is the point of statistics?What is the point of statistics?• Analyse data you already have

– corpus linguistics• Design new experiments

– collect new data, add annotation– experimental linguistics in the lab

• Try new methods– pose the right question

• We are going to focus onz and χ² tests

experimental science}

observational science}

philosophy of science}

a little maths}

What is ‘What is ‘inferentialinferential statistics’?statistics’?• Suppose we carry out an experiment

– We toss a coin 10 times and get 5 heads– How confident are we in the results?

• Suppose we repeat the experiment• Will we get the same result again?

• Inferential statistics is a method of inferring the behaviour of future ‘ghost’ experiments from one experiment– Infer from the sample to the population

• Let us consider one type of experiment– Linguistic alternation experiments

Alternation experimentsAlternation experiments• Imagine a speaker forming a sentence

as a series of decisions/choices. They can– add: choose to extend a phrase or clause,

or stop– select: choose between constructions

• Choices will be constrained – grammatically– semantically

Alternation experimentsAlternation experiments• Imagine a speaker forming a sentence as a

series of decisions/choices. They can– add: choose to extend a phrase or clause, or stop– select: choose between constructions

• Choices will be constrained – grammatically– semantically

• Research question: – within these constraints,

what factors influence the particular choice?

Alternation experimentsAlternation experiments• Laboratory experiment (cued)

– pose the choice to subjects – observe the one they make– manipulate different potential influences

• Observational experiment (uncued)– observe the choices speakers make when they

make them (e.g. in a corpus)– extract data for different potential influences

• sociolinguistic: subdivide data by genre, etc• lexical/grammatical: subdivide data by elements in

surrounding context

Statistical assumptionsStatistical assumptionsA random sample taken from the population

– Not always easy to achieve• multiple cases from the same text and speakers, etc• may be limited historical data available

– Be careful with data concentrated in a few textsThe sample is tiny compared to the

population– This is easy to satisfy in linguistics!

Repeated sampling tends to form a Binomial distribution– This requires slightly more explanation...

The Binomial distributionThe Binomial distribution• Repeated sampling tends to form a

Binomial distribution– We toss a coin 10 times, and get 5 heads:

F

N = 1

x

531 7 9


Binomial distributionF

N = 4

x

531 7 9



N = 8

x

531 7 9



N = 12

x

531 7 9



N = 16

x

531 7 9



N = 20

x

531 7 9



N = 24

x

531 7 9

Binomial Binomial Normal Normal• The Binomial (discrete) distribution tends to

match the Normal (continuous) distribution

x

F

531 7 9

The central limit theoremThe central limit theorem• Any Normal distribution can be defined by

only two variables and the Normal function z

z . s z . s

F

– With more data in the experiment, s will be smaller

p0.50.30.1 0.7

– Divide by 10 for probability scale

population mean x = P

standard deviations = P(1 – P) / n

The central limit theoremThe central limit theorem• Any Normal distribution can be defined by

only two variables and the Normal function z

z . s z . s

F

2.5% 2.5%

population mean x = P

– 95% of the curve is within ~2 standard deviations of the mean

(the correct figure is

1.95996!)

standard deviations = P(1 – P) / n

p0.50.30.1 0.7

95%

The single-sample The single-sample zz test...test...• Is an observation > z standard deviations from the

expected population mean?– If yes, the result is significant

z . s z . s

F

P0.25% 0.25%

p0.50.30.1 0.7

observation p

...gives us a “confidence ...gives us a “confidence interval”interval”• P ± z . s is the confidence interval for P

– Enough for a test

z . s z . s

F

P0.25% 0.25%

p0.50.30.1 0.7

...gives us a “confidence ...gives us a “confidence interval”interval”• P ± z . s is the confidence interval for P

– But we need the interval about p

w+

F

P0.25% 0.25%

p0.50.30.1 0.7

observation p

w–

...gives us a “confidence ...gives us a “confidence interval”interval”• The interval about p is called the

Wilson score interval

w+

F

P0.25% 0.25%

p0.50.30.1 0.7

observation p

w–

• This interval is asymmetric

• It reflects the Normal interval about P:

• If P is at the upper limit of p,p is at the lower limit of P

(Wilson, 1927)

...gives us a “confidence ...gives us a “confidence interval”interval”• The interval about p is called the

Wilson score interval

w+

F

P0.25% 0.25%

p0.50.30.1 0.7

observation p

w–

• To calculate w– and w+ we use this formula:

nz

nz

nppz

nzp

2

2

22

1

4)1(

2

(Wilson, 1927)

Plotting confidence intervalsPlotting confidence intervals• E.g. Plot the probability of adding

successive attributive adjectives to a NP in ICE-GB– You can easily see that the first two falls

are significant, but the last is not

0.00

0.05

0.10

0.15

0.20

0.25

0 1 2 3 4

p

A simple experimentA simple experiment• Consider two binary variables, A and B

– Each one is subdivided: • A = {a, ¬a} e.g. NP has AJP? {yes, no}• B = {b, ¬b} e.g. Speaker gender {male, female}

– Does B ‘affect’ A?• We perform an experiment

(or sample a corpus)– We find 45 cases (NPs)

classified by A and B (left)– This is a ‘contingency table’

A simple experimentA simple experiment• Consider two binary variables, A and B

– Each one is subdivided: • A = {a, ¬a} e.g. NP has AJP? {yes, no}• B = {b, ¬b} e.g. Speaker gender {male, female}

– Does B ‘affect’ A?• We perform an experiment

(or sample a corpus)– We find 45 cases (NPs)

classified by A and B (left)– This is a ‘contingency table’

• Q1. Does B cause a to differ from A?– Does speaker gender affect decision to include an AJP?

a ¬a b 20 5 25¬b 10 10 20 30 15 45

A = dependent variable

B = independent variable

Does Does BB cause cause aa to differ from to differ from AA??• Compare column 1 (a) and column 3 (A)

– Probability of picking b at random (gender = male)

• p(b) = 25/45 = 5/9 = 0.556 a ¬a b 20 5 25¬b 10 10 20 30 15 45


– Probability of picking b at random (gender = male) • p(b) = 25/45 = 5/9 = 0.556

• Next, examine a (has AJP)– New probability of picking b

• p(b | a) = 20/30 = 2/3 = 0.667– Confidence interval for p(b | a)

• population standard deviations = p(b)(1–p(b))/n = (5/9

4/9) / 30• p z.s = (0.489, 0.845)

a ¬a b 20 5 25¬b 10 10 20 30 15 45


– Probability of picking b at random (gender = male) • p(b) = 25/45 = 5/9 = 0.556

• Next, examine a (has AJP)– New probability of picking b

• p(b | a) = 20/30 = 2/3 = 0.667– Confidence interval for p(b)

• population standard deviations = p(b)(1–p(b))/n = (5/9

4/9) / 30• p z.s = (0.378, 0.733)

• Not significant: p(b | a) is inside c.i. for p(b)

a ¬a b 20 5 25¬b 10 10 20 30 15 45

Visualising this testVisualising this test• Confidence interval for p(b)

– P = expected value E = expected distribution

z . s z . s

F

P

p0.556

0.378 0.733

p(b) 0.667p

E

p(b | a)

A a

p(b)

The single-sample The single-sample zz testtest• Compares an observation with a given value

– We used it to compare p(b | a) with p(b)– This is a “goodness of fit” test– Identical to a standard 21 χ² test– No need to test p(¬b | a) with p(¬b)

• Note that p(b) is given– All of the variation is assumed to be

in the estimation of p(b | a)– Could also compare p(b | ¬a) (no AJP) with p(b)

• Q2. Does B cause a to differ from ¬a?– Does speaker gender affect presence / absence of AJP?

p

E

A a

zz test for 2 independent test for 2 independent proportionsproportions• Method: combine observed values

– take the difference (subtract) |p1 – p2|– calculate an ‘averaged’ confidence interval

p

O1

O2

p

O1

O2

¬a a

F p2 = p(b | ¬a)

p1 = p(b | a)p2

p1

zz test for 2 independent test for 2 independent proportionsproportions• New confidence interval D = |O1 – O2|

– standard deviation s' = p(1 – p) (1/n1 +1/n2)– p = p(b) = 25/45 = 5/9– compare

z.s' with x = |p1 – p2|

D

p

D

x difference in p

x = |p1 – p2|

^ ^

^

z.s'a ¬a

b 20 5 25¬b 10 10 20 30 15 45

n1 n2 mean x = 00

Does Does BB cause cause aa to differ from to differ from ¬¬aa??• Compare column 1 (a) and column 2 (¬a)

– Probabilities (speaker gender = male)• p(b | a) = 20/30 = 2/3 = 0.667• p(b | ¬a) = 5/15 = 1/3 = 0.333

– Confidence interval• pooled probability estimate

p = p(b) = 5/9 = 0.556• standard deviation

s' = p(1 – p) (1/n1 + 1/n2) = (5/9

4/9) (1/30 + 1/15)

• z.s' = 0.308

a ¬a b 20 5 25¬b 10 10 20 30 15 45

^

^ ^

Does Does BB cause cause aa to differ from to differ from ¬¬aa??• Compare column 1 (a) and column 2 (¬a)

– Probabilities (speaker gender = male)• p(b | a) = 20/30 = 2/3 = 0.667• p(b | ¬a) = 5/15 = 1/3 = 0.333

– Confidence interval• pooled probability estimate

p = p(b) = 5/9 = 0.556• standard deviation

s' = p(1 – p) (1/n1 + 1/n2) = (5/9

4/9) (1/30 + 1/15)

• z.s' = 0.308

• Significant: |p(b | a) – p(b | ¬a)| > z.s'

a ¬a b 20 5 25¬b 10 10 20 30 15 45

^

^ ^

zz test for 2 independent test for 2 independent proportionsproportions• Identical to a standard 22 χ² test

– So you can use the usual method!


– So you can use the usual method!• BUT: these tests have different

purposes– 21 goodness of fit compares

single value a with superset A• assumes only a varies

– 22 test compares two valuesa, ¬a within a set A

• both values may vary

A

a

g.o.

f.

2

2 2 2

¬a


– So you can use the usual method!• BUT: these tests have different purposes

– 21 goodness of fit compares single value a with superset A

• assumes only a varies– 22 test compares two values

a, ¬a within a set A• both values may vary

• Q: Do we need χ²?

A

a

g.o.

f.

2

2 2 2

¬a

Larger χ² testsLarger χ² tests• χ² is popular because it can be applied to

contingency tables with many values• r 1 goodness of fit χ² tests (r 2)• r c χ² tests for homogeneity (r,c 2)

• z tests have 1 degree of freedom• strength: significance is due to only one source• strength: easy to plot values and confidence intervals• weakness: multiple values may be unavoidable

• With larger χ² tests, evaluate and simplify:• Examine χ² contributions for each row or column• Focus on alternation - try to test for a speaker choice

How big is the effect?How big is the effect?• These tests do not measure the

strength of the interaction between two variables– They test whether the strength of an

interaction is greater than would be expected by chance

• With lots of data, a tiny change would be significant

How big is the effect?How big is the effect?• These tests do not measure the strength

of the interaction between two variables– They test whether the strength of an

interaction is greater than would be expected by chance

• With lots of data, a tiny change would be significant– Don’t use χ², p or z values to compare two

different experiments• A result significant at p<0.01 is not ‘better’ than one

significant at p<0.05

How big is the effect?How big is the effect?• These tests do not measure the strength of

the interaction between two variables– They test whether the strength of an interaction

is greater than would be expected by chance• With lots of data, a tiny change would be significant

– Don’t use χ², p or z values to compare two different experiments

• A result significant at p<0.01 is not ‘better’ than one significant at p<0.05

• There are a number of ways of measuring ‘association strength’ or ‘effect size’

Percentage swingPercentage swing• Compare probabilities of a DV value (a,

AJP) across a change in the IV (gender):– swing d = p(a | ¬b) – p(a | b) = 10/20 – 20/25 = -0.3

a ¬a b 20 5 25¬b 10 10 20 30 15 45

Percentage swingPercentage swing• Compare probabilities of a DV value (a,

AJP) across a change in the IV (gender):– swing d = p(a | ¬b) – p(a | b) = 10/20 – 20/25 = -0.3

• As a proportion of theinitial value– % swing d

% = d/p(a | b) = -0.3/0.8

a ¬a b 20 5 25¬b 10 10 20 30 15 45

Percentage swingPercentage swing• Compare probabilities of a DV value (a, AJP)

across a change in the IV (gender):– swing d = p(a | ¬b) – p(a | b) = 10/20 – 20/25 = -0.3

• As a proportion of theinitial value– % swing d

% = d/p(a | b) = -37.5%

• We can even calculateconfidence intervals on d or d

%

– Use z test for two independent proportions(we are comparing differences in p values)

a ¬a b 20 5 25¬b 10 10 20 30 15 45

Cramér’s φCramér’s φ• Can be used on any χ² table

– Mathematically well defined– Probabilistic (c.f. swing d [-1, +1], d

% = ?) = 0 no relationship between A and B = 1 B strictly determines A• straight line between these two extremes

a ¬a b 0.5 0.5 1¬b 0.5 0.5 1 1 1 2

a ¬a b 1 0 1¬b 0 1 1 1 1 2

= 0 = 1


– Mathematically well defined – Probabilistic (c.f. swing d [-1, +1], d


a ¬a b 0.5 0.5 1¬b 0.5 0.5 1 1 1 2

a ¬a b 1 0 1¬b 0 1 1 1 1 2

= 0 = 1

‘averaged’ swing}


– Mathematically well defined – Probabilistic (c.f. swing d [-1, +1], d


– Based on χ² = χ²/N (22) N = grand total c = χ²/(k – 1)N (r c ) k = min(r, c)


– Mathematically well defined– Probabilistic (c.f. swing d [-1, +1], d


– Based on χ² = χ²/N (22) N = grand total c = χ²/(k – 1)N (r c ) k = min(r, c)

• Can be used for r 1 goodness of fit tests – Recalibrate using methods in Wallis (2012)– Better indicator than percentage swing

Significantly Significantly better?better?• Suppose we have two similar

experiments– How do we test if one result is significantly

stronger than another?

Significantly Significantly better?better?• Suppose we have two similar

experiments– How do we test if one result is significantly

stronger than another?• Test swings • Use z test for two samples from

different populations• Use s' = s1

2 + s22

• Test |d1(a) – d2(a)| > z.s'

a ¬a b 20 5 25¬b 10 10 20 30 15 45

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

d1(a) d2(a)

a ¬a b 50 5 55¬b 10 10 20 30 15 75

Modern improvements on Modern improvements on zz

and χ² and χ² • ‘Continuity correction’ for small n

– Yates’ χ2 test



– Yates’ χ2 test• Wilson’s score interval

– The correct formula for intervals on p

p

p

w– w+

0



– Yates’ χ2 test – can be used elsewhere• Wilson’s score interval

– The correct formula for intervals on p• Newcombe (1998)

improves on 22 χ² test– Uses the Wilson interval– Better than χ² and

log-likelihood (etc.)for low-frequency events

p

p

w– w+

0

ConclusionsConclusions• The basic idea of all of these tests is

– Predict future results if experiment were repeated• ‘Significant’ = effect > 0 (e.g. 19 times out of 20)

• Based on the Binomial distribution– Approximated by Normal distribution – many uses

• Plotting confidence intervals• Use goodness of fit or single-sample z tests to

compare a sample, a, with a point it is dependent on, A• Use 22 tests or two independent sample z tests to

compare two observed samples (a, ¬a)• When using larger r c tests, simplify as far as

possible to identify the source of variation!

ConclusionsConclusions• Two methods for measuring the ‘size’ of an

experimental effect– Simple idea, easy to report

• absolute or percentage swing– More reliable, but possibly less intuitive

• Cramér’s φ– You can compare two experiments

• Is absolute swing significantly greater?• Use a type of z test!• A similar approach is possible with φ

• Take care with small samples / low frequencies– Use Wilson and Newcombe’s methods instead!

ReferencesReferences• Newcombe, R.G. 1998. Interval estimation for the

difference between independent proportions: comparison of eleven methods. Statistics in Medicine 17: 873-890

• Wallis, S.A. 2009. Binomial distributions, probability and Wilson’s confidence interval. London: Survey of English Usage

• Wallis, S.A. 2010. z-squared: The origin and use of χ². London: Survey of English Usage

• Wallis, S.A. 2012. Goodness of fit measures for discrete categorical data. London: Survey of English Usage

• Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: 209-212

• Assorted statistical tests:– www.ucl.ac.uk/english-usage/staff/sean/resources/

2x2chisq.xls

z -squared: the origin and use of χ²

Documents

Transcript of z -squared: the origin and use of χ²