Arthur Berg Pennsylvania State University · L Anna Karenina(4) L Middlemarch(4) L The Brothers...

Standing Between a Bayesian and a Frequentist: An Emperical BayesExploration of Movies, Baseball, and Williams College

Arthur BergPennsylvania State University

Introduction Bayes Estimation Empirical Bayes Books Books Summary

Bayesian and Frequentist Representatives

Rev. Thomas Bayes FRS (1702-1761)English MathematicianPresbyterian Minister

P (H ∣E) = P (E∣H)P (H)P (E)

Sir Ronald Fisher FRS (1890-1962)English StatisticianEvolutionary Biologist, Geneticist

—Let the data speak for itself.—

Arthur Berg Standing Between a Bayesian and a Frequentist 2 / 27


Bayes Estimator as a Convex Combination

1st Goal: List the top 250 movies of all time.

Movies are rated on a scale of 1 to 10.

Some movies are rated by many people, and some by only a few.

Movies with fewer than 3000 votes are not considered.

All movies have an average rating of C = 6.9.

⋆ µi represents the mean rating by everyone who has seen movie i.⋆ The real goal is to construct the best estimate of µi, then pick the top 250.

The frequentist approach uses only Xi, the average rating for movie i.

µ(Fisher)i = Xi

The Bayesian approach shrinks Xi towards C with more shrinkingapplied when the number of votes for movie i is small.

µ(Bayes)i = αiXi + (1 − αi)C where αi ∈ (0,1)



Internet Movie Database—Top 250Rank WR R Title Votes

1 9.2 9.2 The Shawshank Redemption (1994) 546,1552 9.1 9.2 The Godfather (1972) 427,9613 9.0 9.0 The Godfather: Part II (1974) 257,6434 8.9 9.0 The Good, the Bad and the Ugly (1966) 170,0455 8.9 9.0 Pulp Fiction (1994) 436,4566 8.9 8.9 Inception (2010) 265,5317 8.9 8.9 Schindler’s List (1993) 289,1708 8.9 8.9 12 Angry Men (1957) 126,9839 8.8 8.9 One Flew Over the Cuckoo’s Nest (1975) 225,419

10 8.8 8.9 The Dark Knight (2008) 487,800⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯85 8.5 8.7 Black Swan (2010) 20,326⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯

142 8.2 8.3 Avatar (2009) 285,005⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯240 8.0 8.5 True Grit (2010) 6,444



IMDb Weighted Ranking—“a true Bayesian estimate”

WRi = viRi +mCvi +m = vi

vi +m´¹¹¹¹¹¹¸¹¹¹¹¹¹¹¶αi

Ri¯Xi

+ m

vi +m´¹¹¹¹¹¹¸¹¹¹¹¹¹¹¶1−αi

C

▸ Ri = average rating of the movie i (Xi)

▸ vi = total number of votes from regular voters

▸ m = minimum # of votes to make the list = 3000

▸ C = grand mean across all movies in the database = 6.9



A Bayesian Calculation

Xi = (Xi,1, . . . ,Xi,vi) represents the vi ratings of movie i.

prior: µi ∼ N (µ0, σ20)

conditional: Xi,j ∣µi iid∼ N (µi, σ2) (j = 1, . . . , vi)µ(Bayes)

i = E[µi∣Xi]= ( vi

vi + σ2/σ20

) Xi + ( σ2/σ20

vi + σ2/σ20

)µ0

= vivi +mRi + m

vi +mC ⇒ µ0 = C, m = σ2/σ20


1 ¿Does shrinking really help?

2 ¿How much to shrink by?





Prediction Error =



�

i

(µi − µi)2

Prediction Error =�

i

(µi − µi)2

Prediction Error = n�i=1(µi − µi)2


Standing Between a Bayesian and a Frequentist

▸ In 1956, Charles Stein proved the existence of an estimator better thanthe sample mean under certain assumptions.

▸ In 1961, Willard James and Charles Stein explicitly constructed such anestimator.



The James-Stein Estimator (n ≥ 4)

µi ∼ N (µ0, σ20) Xi∣µi iid∼ N (µi, σ2) (i = 1, . . . n)

µ(Bayes)

i = E [µi∣Xi] = ( σ2

σ20 + σ2´¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¶α

)µ0 + ( σ20

σ20 + σ2´¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¶1−α

)Xi

µ(JS)

i = ( (n − 3)σ2

∑(Xi − X)2´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶α

)X + (1 − (n − 3)σ2

∑(Xi − X)2´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶1−α

)Xi

In practice, if σ2 is unknown, an estimate is used.



Predicting Batting Averages

2nd Goal: Predict final batting averages from pre-season performances.

Pre-season batting averages for 18 major league players are provided.

Season final batting averages for the same players are also recorded.

Data is from the 1970 season and is published in JASA (1975) andScientific American (1977) by Efron and Morris.

The frequentist approach uses only Xi, the pre-season batting averagefor player i. p

(Fisher)i =Xi

The Emperical Bayes approach shrinks Xi towards X by someempirically determined amount.

p(Stein)i = αXi + (1 − α)X where α ∈ (0,1)



Name hits/AB pre-season (µ(ML)) season final (µ)

1 Clemente 18/45 0.400 0.3462 Robinson 17/45 0.378 0.2983 Howard 16/45 0.356 0.2764 Johnstone 15/45 0.333 0.2225 Berry 14/45 0.311 0.2736 Spencer 14/45 0.311 0.2707 Kessinger 13/45 0.289 0.2638 Alvarado 12/45 0.267 0.2109 Santo 11/45 0.244 0.269

10 Swoboda 11/45 0.244 0.23011 Unser 10/45 0.222 0.26412 Williams 10/45 0.222 0.25613 Scott 10/45 0.222 0.30314 Petrocelli 10/45 0.222 0.26415 Rodriguez 10/45 0.222 0.22616 Campaneris 9/45 0.200 0.28617 Munson 8/45 0.178 0.31618 Alvis 7/45 0.156 0.200



Batting Average Dataset

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1977 Batting Averages Dataset (Efron)Ba

tting

Ave

rage

0.0

0.1

0.2

0.3

0.4

pre−seasonseason final



James-Stein Estimation of Batting Averages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1977 Batting Averages Dataset (Efron)

Batti

ng A

vera

ge

0.0

0.1

0.2

0.3

0.4


− − − − − − − − − − − − − − − − − −



Ranking Bias—Emperical Bayes + Order Statistics

▸ Genome-wide association studies

▸ SNPS: AA/Aa/aa or 0/1/2(∼ 107)

▸ Estimated effects of the top SNPsare biased up. (winner’s curse)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1977 Batting Averages Dataset (Efron)

Batti

ng A

vera

ge

0.0

0.1

0.2

0.3

0.4


▸ ranking bias estimator—part frequentist, part Bayesianwith robust properties

▸ Applied to 2 GWAS studies with2,000 cases and 3,000 controls

Crohn’s DiseaseType 1 Diabetes



Williams College Book Survey

In the summer of 2009, Williams faculty members were asked to listthree books they felt that students should read.

150 faculty members responded.

25 departments are represented.

394 different books were recommended.

The original publication dates were added (wikipedia/openlibrary.org).

▶ Books with unknown publication dates (13 in total) were approximated.



The Top Picks

Most Picked Authors (4+ hits)▸ Fyodor Dostoyevsky (6)The Brothers Karamazov (4)Crime and Punishment (1)Notes from the Underground (1)▸ Gabriel Garcıa Marquez (5)One Hundred Years of Solitude (5)▸ Leo Tolstoy (5)Anna Karenina (4)War and Peace (1)▸ Bill Bryson (4)A Short History of Nearly Everything (3)In a Sunburned Country (1)▸ George Eliot (4)Middlemarch (4)▸ Henry David Thoreau (4)Walden (4)▸ Vladimir Nabokov (4)Speak, Memory (3)Lolita (1)

Most Picked Titles (3+ hits)

▸ One Hundred Years ofSolitude (5)

▸ Anna Karenina (4)

▸ Middlemarch (4)

▸ The Brothers Karamazov (4)

▸ Walden (4)

▸ Independent People (3)

▸ Speak, Memory (3)

▸ The Death and Life of GreatAmerican Cities (3)

▸ The Things They Carried (3)



Average Publication Year Predictions

▸ Let µi represent average publication year for department i.

▸ Let Xi be the average publication year for department i based on onlythe first book selected.

3rd Goal: Estimate µi with only Xi.



Observed Data: First Book (Red), “Truth”: All Books (Gray)12

0014

0016

0018

0020

00

Cla

ssic

s

Asi

an S

tud

Ant

h &

Soc

Rel

igio

n

Hum

aniti

es

Pol

itica

l Sci

Phi

loso

phy

Geo

scie

nces

Mus

ic

Mat

h &

Sta

t

Eng

lish Art

Ast

rono

my

Com

p S

ci

Psy

chol

ogy

His

tory

The

ater

Ger

& R

us

Bio

logy

Eco

nom

ics

Am

er S

tud

Phy

sics

Com

p Li

t

Che

mis

try

Rom

. Lan

g

3

3

5

4

2

33 2 1 5 8 4 3 2 6 3 6 5

10

11 18 10 10 11 12



Results

µi ∼ N (µ0, σ20) Xi∣µi iid∼ N (µi, σ2

i ) (i = 1, . . .25)Set

σ2i = 1

n ∑(Xi − X)2

niwhere ni = the number of observed books in department i.

1 µ(1)i =Xi

2 µ(2)i = αiXi + (1 − αi)X

3 µ(3)i = αiXi + (1 − αi)X where X denotes the median of X’s.

Prediction Error = 25∑i=1(µ(j)i − µi)2

pe2

pe1

= .583pe3

pe1

= .543



James-Stein Shrinkage Toward the Median “Unequal Variances Case”

1200

1400

1600

1800

2000

Cla

ssic

s

Asi

an S

tud

Ant

h &

Soc

Rel

igio

n

Hum

aniti

es

Pol

itica

l Sci

Phi

loso

phy

Geo

scie

nces

Mus

ic

Mat

h &

Sta

t

Eng

lish Art

Ast

rono

my

Com

p S

ci

Psy

chol

ogy

His

tory

The

ater

Ger

& R

us

Bio

logy

Eco

nom

ics

Am

er S

tud

Phy

sics

Com

p Li

t

Che

mis

try

Rom

. Lan

g

3

3

5

4

2

33 2 1 5 8 4 3 2 6 3 6 5

10

11 18 10 10 11 12

●●

●●

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

−−

−

−

−−

− − − − − − − − − − − − − − − − − − −



4th Goal: Investigate how the departments cluster based on the book survey.

———–Departments are classified in the following groups———–

Natural Sciences: Astronomy, Biology, Chemistry, Geosciences, Physics

Social Sciences: American Studies, Anthropology & Sociology, AsianStudies, Economics, History, Political Science, Psychology

Formal Sciences: Computer Science, Mathematics & Statistics

Humanities: Art, Classics, Comparative Literature, English, German &Russian, Humanities, Music, Philosophy, Religion, RomanceLanguages, Theater



Departments Ranked by Publication Year14

0016

0018

0020

00

Phi

loso

phy

Ant

h &

Soc

Cla

ssic

s

Asi

an S

tud

Pol

itica

l Sci

Rel

igio

n

Mat

h &

Sta

t

Hum

aniti

es

Ast

rono

my

Geo

scie

nces

Eng

lish

Eco

nom

ics

Com

p S

ci

Ger

& R

us

Mus

ic

Art

Am

er S

tud

His

tory

Psy

chol

ogy

Com

p Li

t

The

ater

Rom

. Lan

g

Phy

sics

Bio

logy

Che

mis

try

9

14

99

3011 32

63 9 54 36 15 9 6 29 6 29 24 8 12 15 18 33 18



Distance Measures

▸ Author/Title Data: Jaccard distance=1 − ∣A∩B∣∣A∪B∣ = ∣A∪B∣−∣A∩B∣∣A∪B∣▸ Year data: absolute value of the two sample t-statistic (non-metricdistance measure)

Homework

Prove the Jaccard distance is a proper metric.



Dendrogram of the Year Distances (Philosophy Removed)

0.0

0.5

1.0

1.5

2.0 ●

Che

mis

try

●

Ant

h &

Soc

●

Mat

h &

Sta

t

●

Rel

igio

n

●

Pol

itica

l Sci

●

Asi

an S

tud

●

Cla

ssic

s

●

Eco

nom

ics

●

Eng

lish

●

Hum

aniti

es

●

Ast

rono

my

●

Geo

scie

nces

●

Art

●M

usic

●

Com

p S

ci

●

Ger

& R

us

●

Psy

chol

ogy

●

Am

er S

tud

●

His

tory

●

Bio

logy

●

Phy

sics

●

Com

p Li

t

●

Rom

. Lan

g

●

The

ater

Phi

loso

phy

Ant

h &

Soc

Cla

ssic

s

Asi

an S

tud

Pol

itica

l Sci

Rel

igio

n

Mat

h &

Sta

t

Hum

aniti

es

Ast

rono

my

Geo

scie

nces

Eng

lish

Eco

nom

ics

Com

p S

ci

Ger

& R

us

Mus

ic

Art

Am

er S

tud

His

tory

Psy

chol

ogy

Com

p Li

t

The

ater

Rom

. Lan

g

Phy

sics

Bio

logy

Che

mis

try



Multidimensional Scaling of Author Distances

−0.5 0.0 0.5

−0.

6−

0.4

−0.

20.

00.

20.

40.

6

Amer Stud

Anth & Soc

Art

Asian StudAstronomy

BiologyChemistry

Classics

Comp Lit

Comp Sci

Economics English

Geosciences

Ger & Rus

History

Humanities

Math & Stat

Music

Philosophy

Physics

Political Sci

Psychology

Religion

Rom. LangTheater



Summary

▸ There are often multiple statistical approaches to a single problem.

▸ The complete statistician makes use of all available tools.

▸ When reporting the mean values of several related quantities, thinkabout shrinkage!



Thank You!!

Williams.ArthurBerg.com

[email protected]


Arthur Berg Pennsylvania State University · L Anna Karenina(4) L Middlemarch(4) L The Brothers...

Documents

Transcript of Arthur Berg Pennsylvania State University · L Anna Karenina(4) L Middlemarch(4) L The Brothers...