Lecture 1: Basic Statistical Tools - University of...

39
1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG 2012 course version 28 Jan 2012

Transcript of Lecture 1: Basic Statistical Tools - University of...

Page 1: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

1

Lecture 1:Basic Statistical Tools

Bruce Walsh lecture notesUppsala EQG 2012 course

version 28 Jan 2012

Page 2: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

2

A random variable (RV) = outcome (realization) not a setvalue, but rather drawn from some probability distribution

A discrete RV x --- takes on values X1, X2, … Xk

Probability distribution: Pi = Pr(x = Xi)

Pi > 0, ! Pi = 1

Discrete Random Variables

Example: Binominal random variable. Let p = prob of a success.

Prob (k successes in n trails) = n!/[ (n-k)! k!] pk (1-p)n-k

Probabilities are non-negativeand sum to one

Example: Poisson random variable. Let " = expected number of

successes. Prob (k successes) = "k exp(- ")/k!

Page 3: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

3

A continuous RV x can take on any possible value in some interval (or set of intervals). The probability distribution is defined by the probability density function, p(x)

Continuous Random Variables

p(x) ≥ 0 and

∫ ∞

−∞p(x)dx = 1

P(x1 ≤ x ≤ x2) =x2

x1

p(x)dx

cdf(x) =∫ x

−∞p(x)dx

Finally, the cdf, or cumulative probability function,is defined as cdf(z) = Pr( x < z)

Prob is area underthe curve

Page 4: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

4

Example: The normal (or Gaussian)distribution

Mean µ, variance #2

Unit normal(mean 0, variance 1)

Page 5: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

5

Mean (µ) = peak of distribution

The variance is a measure of spread about the mean. Thesmaller #2, the narrower the distribution about the mean

Page 6: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

6

Joint and Conditional Probabilities

The probability for a pair (x,y) of random variables isspecified by the joint probability density function, p(x,y)

The marginal density of x, p(x)

P ( y1 ≤ y ≤ y2, x1 ≤ x ≤ x2 ) =∫ y2

y1

∫ x2

x1

p(x, y)dx dy

p(x) =∫ ∞

−∞p (x, y)dy

Page 7: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

7

Joint and Conditional Probabilities

p(y|x), the conditional density y given x

Relationships among p(x), p(x,y), p(y|x)

x and y are said to be independent if p(x,y) = p(x)p(y)

Note that p(y|x) = p(y) if x and y are independent

p(x, y) = p(y | x )p(x), hence p(y |x ) =p(x, y)p(x)

P( y1 ≤ y ≤ y2 |x ) =y2

y1

p( y |x ) dy

Page 8: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

8

Bayes’ TheoremSuppose an unobservable RV takes on values b1 .. bn

Suppose that we observe the outcome A of an RV correlatedwith b. What can we say about b given A?

Bayes’ theorem:

A typical application in genetics is that A is somephenotype and b indexes some underlying (but unknown) genotype

Pr(bj | A) =Pr(bj)Pr(A |bj )

Pr(A)=

Pr(bj) Pr(A | bj)n∑

i=1

Pr(bi) Pr(A | bi)

Page 9: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

9

Example: BRCA1/2 & Breastcancer

• NCI statistics:– 12% is lifetime risk of breast cancer in females

– 60% is lifetime risk if carry BRCA 1 or 2 mutation

– One estimate of BRCA 1 or 2 allele frequency is around2.3%.

– Question: Given a patent has breast cancer, what is thechance that she has a BRCA 1 or BRCA 2 mutation?

Page 10: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

10

• Here– Event B = has a BRCA mutation

– Event A = has breast cancer

• Bayes: Pr(B|A) = Pr(A|B)* Pr(B)/Pr(A)– Pr(A) = 0.12

– Pr(B) = 0.023

– Pr(A|B) = 0.60

– Hence, Pr(BRCA|Breast cancer) = [0.60*0.023]/0.12 =0.115

• Hence, for the assumed BRCA frequency (2.3%),11.5% of all patients with breast cancer have aBRCA mutation

Page 11: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

11

0.90.60.3Pr(height >70 | genotype)

0.20.30.5Freq(genotype)

qqQqQQGenotype

Pr(height > 70) = 0.3*0.5 +0.6*0.3 + 0.9*0.2 = 0.51

Pr(QQ | height > 70) = Pr(QQ) * Pr (height > 70 | QQ)

Pr(height > 70)

= 0.5*0.3 / 0.51 = 0.294

Second example: Suppose height > 70. What isThe probability individual is QQ, Qq, qq?

Page 12: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

12

Expectations of Random Variables

The expected value, E [f(x)], of some function x of therandom variable x is just the average value of that function

E[x] = the (arithmetic) mean, µ, of a random variable x

E(x) = µ =+∞

−∞xp(x)dx

E[f(x)] =∑

i

Pr(x = Xi)f(Xi) x discrete

E[f(x)] =∫ +∞

−∞f(x)p(x)dx x continuous

-

Page 13: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

13

Expectations of Random Variables

E[ (x - µ)2 ] = # 2, the variance of x

More generally, the rth moment about the mean is given by E[ (x - µ)r ] r =2: variance. r =3: skew

r = 4: (scaled) kurtosis

Useful properties of expectations

E[g(x) + f(y)] = E[g(x)] + E[f(y)]E(cx) = cE(x)

E[(x− µ)2

]= σ2 =

∫ +∞

−∞(x− µ)2 p(x)dx

Page 14: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

14

The truncated normalOnly consider values of T or above in a normal

T

Let $T = Pr(z > T)

E[ z | z > T ] =∫ ∞

t

z p(z)πT

dz = µ +σ · pT

πT

Mean of truncated distribution

pT = (2π) 1/2 exp[−(T µ)2

2σ2

]

Here pT is the height of the normalat the truncation point,

Density function = p(x | x > T)

p(z)Pr(z > T )

=p(z)

∞T p(z)dz

Page 15: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

15

The truncated normal

Variance

[

1 +pT · (z µ)/σ

πT

(pT

πT

)2]

σ2

pT = height of normal at T$T = area of curve to right of T

Page 16: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

16

Covariances

• Cov(x,y) = E [(x-µx)(y-µy)]

X

Y

cov(X,Y) > 0

Cov(x,y) > 0, positive (linear) association between x & y

• = E [x*y] - E[x]*E[y]

Page 17: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

17

Cov(x,y) < 0, negative (linear) association between x & y

X

Y

cov(X,Y) < 0

Cov(x,y) = 0, no linear association between x & y

X

Y

cov(X,Y) = 0

Page 18: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

18

Cov(x,y) = 0 DOES NOT imply no association

X

Y

cov(X,Y) = 0

Page 19: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

19

Correlation

Cov = 10 tells us nothing about the strength of anassociation

What is needed is an absolute measure of association

This is provided by the correlation, r(x,y)

r = 1 implies a perfect (positive) linear association

r = - 1 implies a perfect (negative) linear association

√r(x, y) =Cov(x, y)

V ar(x)V ar(y)

Page 20: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

20

Useful Properties of Variances andCovariances

• Symmetry, Cov(x,y) = Cov(y,x)

• The covariance of a variable with itself is thevariance, Cov(x,x) = Var(x)

• If a is a constant, then– Cov(ax,y) = a Cov(x,y)

• Var(a x) = a2 Var(x).– Var(ax) = Cov(ax,ax) = a2 Cov(x,x) = a2Var(x)

• Cov(x+y,z) = Cov(x,z) + Cov(y,z)

Page 21: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

21

Hence, the variance of a sum equals the sum of theVariances ONLY when the elements are uncorrelated

More generally

Cov n∑

i=1

xi,m∑

j=1

yj =

n∑

i=1

m∑

j=1

Cov(xi, yj)

V ar(x + y) = V ar(x) + V ar(y) + 2Cov(x, y)

Question: What is Var(x-y)?

Page 22: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

22

RegressionsConsider the best (linear) predictor of y given we know x

The slope of this linear regression is a function of Cov,

The fraction of the variation in y accounted for by knowing x, i.e,Var(yhat - y), is r2

|y = y + by x( x x )

by | x =Cov(x, y)V ar(x)

Page 23: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

23

In this case, the fraction of variation accounted forby the regression is b2

r(x, y) =Cov(x, y)√

V ar(x)V ar(y)= by|x

√V ar(x)V ar(y)

Relationship between the correlation and the regressionslope:

If Var(x) = Var(y), then by|x = b x|y = r(x,y)

Page 24: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

24

r2 = 0.6r2 = 0.9

Page 25: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

25

Properties of Least-squares Regressions

The slope and intercept obtained by least-squares: minimize the sum of squared residuals:

• The average value of the residual is zero

• The LS solution maximizes the amount of variation in y that can be explained by a linear regression on x

• The residual errors around the least-squares regression are uncorrelated with the predictor variable x

• Fraction of variance in y accounted by the regression is r2

• Homoscedastic vs. heteroscedastic residual variances

∑e2i =

∑(yi yi)2 =

∑(yi a bxi)2

Page 26: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

26

Different methods of analysis

• Parameters of these various models can beestimated in a number of frameworks

• Method of moments

– Very little assumptions about the underlying distribution.Typically, the mean of some statistic has an expectedvalue of the parameter

– Example: Estimate of the mean µ given by the samplemean,x, as E(x) = µ.

– While estimation does not require distributionassumptions, confidence intervals and hypothesis testingdo

• Distribution-based estimation

– The explicit form of the distribution used

Page 27: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

27

Distribution-based estimation

• Maximum likelihood estimation– MLE

– REML

– More in Lynch & Walsh (book) Appendix 3

• Bayesian– Marginal posteriors

– Conjugating priors

– MCMC/Gibbs sampling

– More in Walsh & Lynch (online chapters = Vol 2)Appendices 2,3

Page 28: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

28

Maximum Likelihood p(x1,…, xn | % ) = density of the observed data (x1,…, xn)given the (unknown) distribution parameter(s) %

Fisher suggested the method of maximum likelihood ---given the data (x1,…, xn) find the value(s) of % thatmaximize p(x1,…, xn | % )

We usually express p(x1,…, xn | %) as a likelihood function l ( % | x1,…, xn ) to remind us that it is dependenton the observed data

The Maximum Likelihood Estimator (MLE) of % are the value(s) that maximize the likelihood function l giventhe observed data x1,…, xn .

Page 29: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

29

l (% | x)

MLE of %

The curvature of the likelihood surface in the neighborhoodof the MLE informs us as to the precision of the estimatorA narrow peak = high precision. A board peak = lowerprecision

This is formalize by looking at the log-likelihood surface,L = ln [l (% | x) ]. Since ln is a monotonic function, thevalue of % that maximizes l also maximizes L

The larger the curvature, the smaller the variance

Var(MLE) = -1 ∂2 L(µ| z)∂ µ2

%

Page 30: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

30

Likelihood Ratio testsHypothesis testing in the ML frameworks occursthrough likelihood-ratio (LR) tests

For large sample sizes (generally) LR approaches aChi-square distribution with r df (r = number ofparameters assigned fixed values under null)

LR = 2 ln(

$(Θr |z)$(Θ | z)

)= 2

[L(Θr | z)−L(Θ | z)

]

%r is the MLE under the restricted conditions (someparameters specified, e.g., var =1)

& r is the MLE under the unrestricted conditions (noparameters specified)

Page 31: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

31

Bayesian StatisticsAn extension of likelihood is Bayesian statistics

p(% | x) = C * l(x | %) p(%)

Instead of simply estimating a point estimate (e.g.., the MLE), the goal is the estimate the entire distribution for the unknown parameter % given the data x

p(% | x) is the posterior distribution for % given the data x

l(x | %) is just the likelihood function

p(%) is the prior distribution on %.

Page 32: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

32

Bayesian Statistics

Why Bayesian?

• Exact for any sample size

• Marginal posteriors

• Efficient use of any prior information

• MCMC (such as Gibbs sampling) methods

Priors quantify the strength of any prior information.Often these are taken to be diffuse (with a highvariance), so prior weights on % spread over a widerange of possible values.

Page 33: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

33

Marginal posteriors• Often times were are interested in a particular

set of parameters (say some subset of the fixedeffects). However, we also have to estimate all ofthe other parameters.

• How do uncertainties in these nuisance parametersfactor into the uncertainty in the parameters ofinterest?

• A Bayesian marginal posterior takes this intoaccount by integrating the full posterior over thenuisance parameters

• While this sound complicated, easy to do withMCMC (Markov Chain Monte Carlo)

Page 34: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

34

Conjugating priors

For any particular likelihood, we can often find aconjugating prior, such that the product of the likelihood and the prior returns a known distribution.

Example: For the mean µ in a normal, taking theprior on the mean to also be normal returns aposterior for µ that is normal.

Example: For the variance #2 in a normal, taking theprior on the variance to an inverse chi-square distribution returns a posterior for #2 that is alsoan inverse chi-square (details in WL Appendix 2).

Page 35: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

35

A normal prior on the mean with mean µ0 andvariance #0

2 (larger #0

2, more diffuse the prior)

If the likelihood for the mean is a normal distribution,the resulting posterior is also normal, with

Note that if #02 is large, the mean of the posterior is

very close to the sample mean.

Page 36: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

36

If x follows a Chi-square distribution, then 1/x followsan inverse chi-square distribution.

The scaled inverse chi-square has two parameters, allowingmore control over the mean and variance of the prior

Page 37: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

37

MCMC

Analytic expressions for posteriors can be complicated,but the method of MCMC (Markov Chain Monte

Carlo) is a general approach to simulating draws for just about any distribution (details in WL Appendix 3).

Generating several thousand such draws from the posterior returns an empirical distribution that wecan use.

For example, we can compute a 95% credible interval,the region of the distribution that containing 95% ofthe probability.

Page 38: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

38

Gibbs Sampling• A very powerful version of MCMC is the Gibbs Sampler

• Assume we are sampling from a vector of parameters,but that the marginal distribution of each parameter isknown

• For example, given a current value for all the fixedeffects (but one, say '1) and the variances, conditioningon these values the distribution of '1 is a normal, whoseparameters are now functions of the current values ofthe other parameters. A random draw is then generatedfrom this distribution.

• Likewise, conditioning on all the fixed effects and allvariances but one, the distribution of this variance is aninverse chi-square

Page 39: Lecture 1: Basic Statistical Tools - University of Arizonanitro.biosci.arizona.edu/workshops/Uppsala2012/...1 Lecture 1: Basic Statistical Tools Bruce Walsh lecture notes Uppsala EQG

39

This generates one cycle of the sampler. Using thesenew values, a second cycle is generated.

Full details in WL Appendix 3.