Arthur Berg Pennsylvania State University · L Anna Karenina(4) L Middlemarch(4) L The Brothers...

27
Standing Between a Bayesian and a Frequentist: An Emperical Bayes Exploration of Movies, Baseball, and Williams College Arthur Berg Pennsylvania State University

Transcript of Arthur Berg Pennsylvania State University · L Anna Karenina(4) L Middlemarch(4) L The Brothers...

Standing Between a Bayesian and a Frequentist: An Emperical BayesExploration of Movies, Baseball, and Williams College

Arthur BergPennsylvania State University

Introduction Bayes Estimation Empirical Bayes Books Books Summary

Bayesian and Frequentist Representatives

Rev. Thomas Bayes FRS (1702-1761)English MathematicianPresbyterian Minister

P (H ∣E) = P (E∣H)P (H)P (E)

Sir Ronald Fisher FRS (1890-1962)English StatisticianEvolutionary Biologist, Geneticist

—Let the data speak for itself.—

Arthur Berg Standing Between a Bayesian and a Frequentist 2 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

Bayes Estimator as a Convex Combination

1st Goal: List the top 250 movies of all time.

Movies are rated on a scale of 1 to 10.

Some movies are rated by many people, and some by only a few.

Movies with fewer than 3000 votes are not considered.

All movies have an average rating of C = 6.9.

⋆ µi represents the mean rating by everyone who has seen movie i.⋆ The real goal is to construct the best estimate of µi, then pick the top 250.

The frequentist approach uses only Xi, the average rating for movie i.

µ(Fisher)i = Xi

The Bayesian approach shrinks Xi towards C with more shrinkingapplied when the number of votes for movie i is small.

µ(Bayes)i = αiXi + (1 − αi)C where αi ∈ (0,1)

Arthur Berg Standing Between a Bayesian and a Frequentist 3 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

Internet Movie Database—Top 250Rank WR R Title Votes

1 9.2 9.2 The Shawshank Redemption (1994) 546,1552 9.1 9.2 The Godfather (1972) 427,9613 9.0 9.0 The Godfather: Part II (1974) 257,6434 8.9 9.0 The Good, the Bad and the Ugly (1966) 170,0455 8.9 9.0 Pulp Fiction (1994) 436,4566 8.9 8.9 Inception (2010) 265,5317 8.9 8.9 Schindler’s List (1993) 289,1708 8.9 8.9 12 Angry Men (1957) 126,9839 8.8 8.9 One Flew Over the Cuckoo’s Nest (1975) 225,419

10 8.8 8.9 The Dark Knight (2008) 487,800⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯85 8.5 8.7 Black Swan (2010) 20,326⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯

142 8.2 8.3 Avatar (2009) 285,005⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯240 8.0 8.5 True Grit (2010) 6,444

Arthur Berg Standing Between a Bayesian and a Frequentist 4 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

IMDb Weighted Ranking—“a true Bayesian estimate”

WRi = viRi +mCvi +m = vi

vi +m´¹¹¹¹¹¹¸¹¹¹¹¹¹¹¶αi

Ri¯Xi

+ m

vi +m´¹¹¹¹¹¹¸¹¹¹¹¹¹¹¶1−αi

C

▸ Ri = average rating of the movie i (Xi)

▸ vi = total number of votes from regular voters

▸ m = minimum # of votes to make the list = 3000

▸ C = grand mean across all movies in the database = 6.9

Arthur Berg Standing Between a Bayesian and a Frequentist 5 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

A Bayesian Calculation

Xi = (Xi,1, . . . ,Xi,vi) represents the vi ratings of movie i.

prior: µi ∼ N (µ0, σ20)

conditional: Xi,j ∣µi iid∼ N (µi, σ2) (j = 1, . . . , vi)µ(Bayes)

i = E[µi∣Xi]= ( vi

vi + σ2/σ20

) Xi + ( σ2/σ20

vi + σ2/σ20

)µ0

= vivi +mRi + m

vi +mC ⇒ µ0 = C, m = σ2/σ20

Arthur Berg Standing Between a Bayesian and a Frequentist 6 / 27

1 ¿Does shrinking really help?

2 ¿How much to shrink by?

1 ¿Does shrinking really help?

2 ¿How much to shrink by?

1 ¿Does shrinking really help?

2 ¿How much to shrink by?

Prediction Error =

1 ¿Does shrinking really help?

2 ¿How much to shrink by?

i

(µi − µi)2

Prediction Error =�

i

(µi − µi)2

Prediction Error = n�i=1(µi − µi)2

Introduction Bayes Estimation Empirical Bayes Books Books Summary

Standing Between a Bayesian and a Frequentist

▸ In 1956, Charles Stein proved the existence of an estimator better thanthe sample mean under certain assumptions.

▸ In 1961, Willard James and Charles Stein explicitly constructed such anestimator.

Arthur Berg Standing Between a Bayesian and a Frequentist 8 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

The James-Stein Estimator (n ≥ 4)

µi ∼ N (µ0, σ20) Xi∣µi iid∼ N (µi, σ2) (i = 1, . . . n)

µ(Bayes)

i = E [µi∣Xi] = ( σ2

σ20 + σ2´¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¶α

)µ0 + ( σ20

σ20 + σ2´¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¶1−α

)Xi

µ(JS)

i = ( (n − 3)σ2

∑(Xi − X)2´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶α

)X + (1 − (n − 3)σ2

∑(Xi − X)2´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶1−α

)Xi

In practice, if σ2 is unknown, an estimate is used.

Arthur Berg Standing Between a Bayesian and a Frequentist 9 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

Predicting Batting Averages

2nd Goal: Predict final batting averages from pre-season performances.

Pre-season batting averages for 18 major league players are provided.

Season final batting averages for the same players are also recorded.

Data is from the 1970 season and is published in JASA (1975) andScientific American (1977) by Efron and Morris.

The frequentist approach uses only Xi, the pre-season batting averagefor player i. p

(Fisher)i =Xi

The Emperical Bayes approach shrinks Xi towards X by someempirically determined amount.

p(Stein)i = αXi + (1 − α)X where α ∈ (0,1)

Arthur Berg Standing Between a Bayesian and a Frequentist 10 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

Name hits/AB pre-season (µ(ML)) season final (µ)

1 Clemente 18/45 0.400 0.3462 Robinson 17/45 0.378 0.2983 Howard 16/45 0.356 0.2764 Johnstone 15/45 0.333 0.2225 Berry 14/45 0.311 0.2736 Spencer 14/45 0.311 0.2707 Kessinger 13/45 0.289 0.2638 Alvarado 12/45 0.267 0.2109 Santo 11/45 0.244 0.269

10 Swoboda 11/45 0.244 0.23011 Unser 10/45 0.222 0.26412 Williams 10/45 0.222 0.25613 Scott 10/45 0.222 0.30314 Petrocelli 10/45 0.222 0.26415 Rodriguez 10/45 0.222 0.22616 Campaneris 9/45 0.200 0.28617 Munson 8/45 0.178 0.31618 Alvis 7/45 0.156 0.200

Arthur Berg Standing Between a Bayesian and a Frequentist 11 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

Batting Average Dataset

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1977 Batting Averages Dataset (Efron)Ba

tting

Ave

rage

0.0

0.1

0.2

0.3

0.4

pre−seasonseason final

Arthur Berg Standing Between a Bayesian and a Frequentist 12 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

James-Stein Estimation of Batting Averages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1977 Batting Averages Dataset (Efron)

Batti

ng A

vera

ge

0.0

0.1

0.2

0.3

0.4

pre−seasonseason final

− − − − − − − − − − − − − − − − − −

Arthur Berg Standing Between a Bayesian and a Frequentist 13 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

Ranking Bias—Emperical Bayes + Order Statistics

▸ Genome-wide association studies

▸ SNPS: AA/Aa/aa or 0/1/2(∼ 107)

▸ Estimated effects of the top SNPsare biased up. (winner’s curse)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1977 Batting Averages Dataset (Efron)

Batti

ng A

vera

ge

0.0

0.1

0.2

0.3

0.4

pre−seasonseason final

▸ ranking bias estimator—part frequentist, part Bayesianwith robust properties

▸ Applied to 2 GWAS studies with2,000 cases and 3,000 controls

Crohn’s DiseaseType 1 Diabetes

Arthur Berg Standing Between a Bayesian and a Frequentist 14 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

Williams College Book Survey

In the summer of 2009, Williams faculty members were asked to listthree books they felt that students should read.

150 faculty members responded.

25 departments are represented.

394 different books were recommended.

The original publication dates were added (wikipedia/openlibrary.org).

▶ Books with unknown publication dates (13 in total) were approximated.

Arthur Berg Standing Between a Bayesian and a Frequentist 15 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

The Top Picks

Most Picked Authors (4+ hits)▸ Fyodor Dostoyevsky (6)The Brothers Karamazov (4)Crime and Punishment (1)Notes from the Underground (1)▸ Gabriel Garcıa Marquez (5)One Hundred Years of Solitude (5)▸ Leo Tolstoy (5)Anna Karenina (4)War and Peace (1)▸ Bill Bryson (4)A Short History of Nearly Everything (3)In a Sunburned Country (1)▸ George Eliot (4)Middlemarch (4)▸ Henry David Thoreau (4)Walden (4)▸ Vladimir Nabokov (4)Speak, Memory (3)Lolita (1)

Most Picked Titles (3+ hits)

▸ One Hundred Years ofSolitude (5)

▸ Anna Karenina (4)

▸ Middlemarch (4)

▸ The Brothers Karamazov (4)

▸ Walden (4)

▸ Independent People (3)

▸ Speak, Memory (3)

▸ The Death and Life of GreatAmerican Cities (3)

▸ The Things They Carried (3)

Arthur Berg Standing Between a Bayesian and a Frequentist 16 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

Average Publication Year Predictions

▸ Let µi represent average publication year for department i.

▸ Let Xi be the average publication year for department i based on onlythe first book selected.

3rd Goal: Estimate µi with only Xi.

Arthur Berg Standing Between a Bayesian and a Frequentist 17 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

Observed Data: First Book (Red), “Truth”: All Books (Gray)12

0014

0016

0018

0020

00

Cla

ssic

s

Asi

an S

tud

Ant

h &

Soc

Rel

igio

n

Hum

aniti

es

Pol

itica

l Sci

Phi

loso

phy

Geo

scie

nces

Mus

ic

Mat

h &

Sta

t

Eng

lish Art

Ast

rono

my

Com

p S

ci

Psy

chol

ogy

His

tory

The

ater

Ger

& R

us

Bio

logy

Eco

nom

ics

Am

er S

tud

Phy

sics

Com

p Li

t

Che

mis

try

Rom

. Lan

g

3

3

5

4

2

33 2 1 5 8 4 3 2 6 3 6 5

10

11 18 10 10 11 12

Arthur Berg Standing Between a Bayesian and a Frequentist 18 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

Results

µi ∼ N (µ0, σ20) Xi∣µi iid∼ N (µi, σ2

i ) (i = 1, . . .25)Set

σ2i = 1

n ∑(Xi − X)2

niwhere ni = the number of observed books in department i.

1 µ(1)i =Xi

2 µ(2)i = αiXi + (1 − αi)X

3 µ(3)i = αiXi + (1 − αi)X where X denotes the median of X’s.

Prediction Error = 25∑i=1(µ(j)i − µi)2

pe2

pe1

= .583pe3

pe1

= .543

Arthur Berg Standing Between a Bayesian and a Frequentist 19 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

James-Stein Shrinkage Toward the Median “Unequal Variances Case”

1200

1400

1600

1800

2000

Cla

ssic

s

Asi

an S

tud

Ant

h &

Soc

Rel

igio

n

Hum

aniti

es

Pol

itica

l Sci

Phi

loso

phy

Geo

scie

nces

Mus

ic

Mat

h &

Sta

t

Eng

lish Art

Ast

rono

my

Com

p S

ci

Psy

chol

ogy

His

tory

The

ater

Ger

& R

us

Bio

logy

Eco

nom

ics

Am

er S

tud

Phy

sics

Com

p Li

t

Che

mis

try

Rom

. Lan

g

3

3

5

4

2

33 2 1 5 8 4 3 2 6 3 6 5

10

11 18 10 10 11 12

●●

●●

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

−−

−−

− − − − − − − − − − − − − − − − − − −

Arthur Berg Standing Between a Bayesian and a Frequentist 20 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

4th Goal: Investigate how the departments cluster based on the book survey.

———–Departments are classified in the following groups———–

Natural Sciences: Astronomy, Biology, Chemistry, Geosciences, Physics

Social Sciences: American Studies, Anthropology & Sociology, AsianStudies, Economics, History, Political Science, Psychology

Formal Sciences: Computer Science, Mathematics & Statistics

Humanities: Art, Classics, Comparative Literature, English, German &Russian, Humanities, Music, Philosophy, Religion, RomanceLanguages, Theater

Arthur Berg Standing Between a Bayesian and a Frequentist 21 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

Departments Ranked by Publication Year14

0016

0018

0020

00

Phi

loso

phy

Ant

h &

Soc

Cla

ssic

s

Asi

an S

tud

Pol

itica

l Sci

Rel

igio

n

Mat

h &

Sta

t

Hum

aniti

es

Ast

rono

my

Geo

scie

nces

Eng

lish

Eco

nom

ics

Com

p S

ci

Ger

& R

us

Mus

ic

Art

Am

er S

tud

His

tory

Psy

chol

ogy

Com

p Li

t

The

ater

Rom

. Lan

g

Phy

sics

Bio

logy

Che

mis

try

9

14

99

3011 32

63 9 54 36 15 9 6 29 6 29 24 8 12 15 18 33 18

Arthur Berg Standing Between a Bayesian and a Frequentist 22 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

Distance Measures

▸ Author/Title Data: Jaccard distance=1 − ∣A∩B∣∣A∪B∣ = ∣A∪B∣−∣A∩B∣∣A∪B∣▸ Year data: absolute value of the two sample t-statistic (non-metricdistance measure)

Homework

Prove the Jaccard distance is a proper metric.

Arthur Berg Standing Between a Bayesian and a Frequentist 23 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

Dendrogram of the Year Distances (Philosophy Removed)

0.0

0.5

1.0

1.5

2.0 ●

Che

mis

try

Ant

h &

Soc

Mat

h &

Sta

t

Rel

igio

n

Pol

itica

l Sci

Asi

an S

tud

Cla

ssic

s

Eco

nom

ics

Eng

lish

Hum

aniti

es

Ast

rono

my

Geo

scie

nces

Art

●M

usic

Com

p S

ci

Ger

& R

us

Psy

chol

ogy

Am

er S

tud

His

tory

Bio

logy

Phy

sics

Com

p Li

t

Rom

. Lan

g

The

ater

Phi

loso

phy

Ant

h &

Soc

Cla

ssic

s

Asi

an S

tud

Pol

itica

l Sci

Rel

igio

n

Mat

h &

Sta

t

Hum

aniti

es

Ast

rono

my

Geo

scie

nces

Eng

lish

Eco

nom

ics

Com

p S

ci

Ger

& R

us

Mus

ic

Art

Am

er S

tud

His

tory

Psy

chol

ogy

Com

p Li

t

The

ater

Rom

. Lan

g

Phy

sics

Bio

logy

Che

mis

try

Arthur Berg Standing Between a Bayesian and a Frequentist 24 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

Multidimensional Scaling of Author Distances

−0.5 0.0 0.5

−0.

6−

0.4

−0.

20.

00.

20.

40.

6

Amer Stud

Anth & Soc

Art

Asian StudAstronomy

BiologyChemistry

Classics

Comp Lit

Comp Sci

Economics English

Geosciences

Ger & Rus

History

Humanities

Math & Stat

Music

Philosophy

Physics

Political Sci

Psychology

Religion

Rom. LangTheater

Arthur Berg Standing Between a Bayesian and a Frequentist 25 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

Summary

▸ There are often multiple statistical approaches to a single problem.

▸ The complete statistician makes use of all available tools.

▸ When reporting the mean values of several related quantities, thinkabout shrinkage!

Arthur Berg Standing Between a Bayesian and a Frequentist 26 / 27

Introduction Bayes Estimation Empirical Bayes Books Books Summary

Thank You!!

Williams.ArthurBerg.com

[email protected]

Arthur Berg Standing Between a Bayesian and a Frequentist 27 / 27