Parametric Inference Maximum Likelihood Inference …bioucas/IP/files/statistical_inference.pdf ·...

IP, José Bioucas Dias, IST, 2007 1

Statistical Inference

Parametric Inference

Maximum Likelihood Inference

Exponential Families

Expectation Maximization (EM)

Bayesian Inference

Statistical Decison Theory

Statistical Inference

Statistics aims at retriving the “causes” (e.g., parameters of a pdf)

from the observations (effects)

Statistical inference problems can thus be seen as Inverse Problems

As a result of this perpective, at the eighteenth century (at the time of Bayes and

Laplace) Statistics was often called Inverse Probability

Probability

Statistics

Parametric Inference

Consider the parametric model where

is the parameter space and is the parameter

The problem of inference reduces to the estimation of from ; i.e,

Parameters of interest and nuisance parameters

Sometimes we are only interested in some function

that depends only on

- parameter of interest;

- nuisance parameter

Example:

Parametric Inference (theoretical limits)

The Cramer Rao Lower Bound (CRLB)

Under under appropriate regularity conditions, the covariance matrix of any

Unbiased estimator , satisfies

where is the Fisher information matrix given by

An unbiased estimator that attains the CRLB may be found iif

For some function h. The estimator is

CRLB for the general Gaussian case

Example: Parameter of a signal in white noise

Example: Known signal in unknown white noise

Maximum Likelihood Method

is the likelihood function

If for all f we can use the log-likelihood

Example (Bernoulli)

Example (Uniform)

Maximum Likelihood

Example (Gaussian)

Sample mean

Sample variace

Maximum Likelihood

Example (Multivariate Gaussian)

Sample mean

Sample covariance

Maximum Likelihood (linear observation model)

Example: Linear observation in Gaussian noise

A is full rank

Example: Linear observation in Gaussian noise (cont.)

• MLE is equivalent to the LSE using the norm

• If , , is given by the Moore-Penrose Pseudo-Inverse

• is a projection matrix

• If the noise is zero-mean but not Gaussian, the Best Linear Unbiased

estimator (BLUE) is still given by

• Is the Minimum Variance Unbiased (MVU) estimator

[ and is the minimum among all unbiased estimators]

• Is efficient (it attains the Camer Rao Lower Bound (CRLB))

• Its PDF is

Linear observation in Gaussian noise

Maximum likelihood

Properties (MLE is optimal for the linear model)

Appealing properties of MLE

Maximum likelihood (characterization)

1. The MLE is consistent: ( denotes the true parameter)

2. The MLE is equivariant: if is the MLE estimate of , then is the

MLE estimate of

3. The MLE (under appropriate regularity conditions) is asymptotically Normal

and optimal or efficient:

Let A sequence of IID vectors in and

Fisher information matrix

The exponential Family

Definition: the set an exponential family of

dimension k if there there are functions

such that

is a sufficient statistic for f , i.e,

Theorem: (Neyman-Fisher Factorization) is a sufficient statistic for

f iif can be factored as

The exponential family

Natural (or canonical) form

Given an exponential family, it is always possible to introduce the change

of variables and the reparemeterization such that

Since is a PDF, it must integrate to one

The exponential family (The partition function)

Computing moments from the derivatives of the partition function

After some calculus

The exponential family (IID sequences)

Let a member of an exponential family defined by

The density of the IID sequence is

belongs exponential family defined by

Examples of exponential families

Many of the most common probabilistic models belong to exponential

families; e.g., Gaussian, Poisson, Bernoulli, binomial, exponential,

gamma, and beta.

Example:

Canonical form

Examples of exponential families (Gaussian)

Example:

Canonical form

Computing maximum likelihood estimates

Very often the MLE can not be found analytically. Commonly

used numerical methods:

1. Newton-Raphson

2. Scoring

3. Expectation Maximization (EM)

Newton-Raphson method

Scoring method

Can be computed off-line

Computing maximum likelihood estimates (EM)

Expectation Maximization (EM) [Dempster, Laird, and Rubin, 1977]

Idea: iterate between two steps:

Suppose that is hard to maximize

But we can find a vector z such that is easy to maximze and

E-step: “Fill in z” in

M-step: Maximize

Terminology

Complete data

Missing data

Observed data

Expectation maximization

The EM algorithm

1. Pick up a starting vector : repeat steps 2. and 3.

2. E-step: Calculate

3. M-step

Alternatively (GEM)

Expectation maximization

The EM (GEM) algorithm always increases the likelihood.

Define

Kulback Leibler

distance

KL distance maximization

Expectation maximization (why does it work?)

EM: Mixtures of densities

Let be the random variable that selects the active mode:

where and

Consider now that is a sequence of IID random variables

Let be IID random variables, where selects the active

mode in the sample :

Equivalent Q

Where is the sample mean of x, i.e.,

E-step:

M-step:

E-step:

M-step:

EM: Mixtures of Gaussian densities (MOGs)

M-step:

E-step:

Weighted sample mean

Weighted sample covariance

EM: Mixtures of Gaussian densities. 1D Example

0 1 0.6316

3 3 0.3158

6 10 0.0526

N = 1900

0 5 10 15 20 25 30-5200

-3800loglikelihood L(fk)

-0.0288 1.0287 0.6258

2.8952 2.5649 0.3107

6.1687 7.3980 0.0635

EM: Mixtures of Gaussian Densities (MOGs)

Example – 1D 0 1 0.6316

3 3 0.3158

6 10 0.0526

N = 1900

-5 0 5 10 150

est MOG

true MOG

-6 -4 -2 0 2 4 6 8 10 120

est modes

true modes

EM: Mixtures of Gaussian Densities: 2D Example

-2 0 2 4

MOG with determination of the number of modes [M. Figueiredo, 2002]

Bayesian Estimation

The Bayesian Philosophy ([Wasserman, 2004])

Bayesian Inference

B1 – Probabilities describe degrees of belief, not limiting relative frequency

B2 – We can make probability statements about parameters, even though

they are fixed parameters

B3 – We make inferences about a parameter by producing a

probalility distribution for

F1 – Probabilities refer to limiting relative frequencies and are objective

properties of the real world

F2 – Parameters are fixed unknown parameters

F3 – The criteria for obtaining statistical procedures are based on long run

frequency properties.

Frequencist or Classical Inference

The Bayesian Philosophy

unknown

Classical Inference

Observation model observation

Prior knowledge

Bayesian Inference

describes degrees of belief (subjective), not limiting frequency

The Bayesian method

1. Choose a prior density , called the prior (or a priori) distribution

that expresses our beliefs about f, before we see any data

2. Choose the observation model that reflects our belief about g

given f

3. Calculate the posterior (or a posteriori) distribution using the

Bayes law:

is the marginal on g (other names: evidence, unconditional, predictive)

4. Any inference should be based on the posterior

for = >1, pulls

towards 1/2

The Bayesian method

Example: Let IID

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

Observation model

Posterior

Example (cont.):

(Bernoulli observations, Beta prior)

Example (cont.):

(Bernoulli observations, Beta prior)

• Total ignorance: flat prior = =1

Maximum a posteriori estimate (MAP)

The von Mises Theorem

If the prior is continuous and not zero at the location of the

MLestimate, then,

• Note that for large values of n

Conjugate priors

In the previous example, the prior and the posterior are both Beta

distributed. We say that the prior is conjugate with respect to the model

• Formally, let and be

two parametrized families of priors and observation models, respectively

• is a conjugate family for if

for some

• Very often, prior information about f is very small, allowing to select

conjugate priors

• Conjugate priors why? Computing the posterior density simply consists

in updating the parameters of the prior

Conjugate priors (Gaussian observation, Gaussian prior)

• Gaussian observations

• Gaussian prior

• The posterior distribution is Gaussian

1. The mean of is in the simplex defined by {g,}

2. The variance of is the parallel of variances and

Conjugate priors (Gaussian IID observations, Gaussian prior)

• Gaussian IID observations

• Gaussian prior

1. The mean of is in the simplex defined by

2. The variance of is the parallel of variances and

Conjugate Priors (Gaussian IID observations, Gaussian prior)

-15 -10 -5 5 10 15

Conjugate Priors (multivariate Gaussian: observation and prior)

• (g,f) jointly Gaussian distributed:

• Then

• Linear observation model (f and w independent)

• Posterior

• Using the matrix inversion lemma

• is the solution of the following regularized LS problem

e.g., penalize

oscillatory solutions

Improper Priors

• Assume that p(f)=k on given domain

• Even if the domain of f is unbounded, and, thus,

the posterior is well defined.

• In a sense, improper priors account for a state of total ignorance. This raises

no issues to the Bayesian framework, as far as the posterior is proper.

Bayes Estimators

Bayes estimators

Ingredients of Statistical Decision Theory:

• posterior distribution

conveys all knowledge about f, given the observation g

• loss function

measures the discrepancy between and .

• a posteriori expected loss

• optimal Bayes estimator

Bayesian framework

• Nuisance Parameter

Let and

Nuisance parameter

• The posterior risk depends only on the marginal on

• In a pure Bayesian framework, nuisance parameters are

integrated out

Bayes estimators: Maximum a posteriori probability (MAP)

• Zero-one, “0/1”, loss Volume of an -ball

• Maximum a posteriori probability

A discrete domain leads to the

MAP estimator as well

Bayes Estimators: Posterior Mean (PM)

• Quadratic loss:

Q is symmetric and positive definite

• Posterior mean

Only this term

Depends on

• Valid for any . If Q diagonal the loss function

is additive

may be hard to compute

Bayes estimators: Additive loss

• Let

• Then, the minimization is decoupled

• Each component of minimizes the corresponding marginal

a posteriori expected loss

Bayes Estimators: Additive Loss

• Additive “0/1” loss:

is the maximizer of the posterior marginal

• Additive quadratic loss:

The additive quadratic loss is a quadratic loss with Q=I. Therefore,

The corresponding Bayes estimator is the posterior mean

Example (Gaussian IID observations, Gaussian prior)

• Gaussian IID observations

• Gaussian prior

Example (Gaussian observation, Laplacian prior)

MAP estimate

• Strictly concave

MAP estimate

PM estimate

No closed form expressions

Resort to numerical procedures

-10 -5 0 5 100

-5 -4 -3 -2 -1 0 1 2 3 4 5-5

• is called the Wiener filter

• If all the eigenvectors of C approaches infinite, then

Example (Multivariate Gaussian: observation and prior)

• Posterior

which is the Moore-Penrose pseudo (or generalized) inverse of A

Parametric Inference Maximum Likelihood Inference …bioucas/IP/files/statistical_inference.pdf ·...

Documents

Transcript of Parametric Inference Maximum Likelihood Inference …bioucas/IP/files/statistical_inference.pdf ·...

Likelihood Asymptotics and Location-ScaleShape AnalysisChapter 2 S t atist ical Inference This diapter reviews some key concepts and definitions in parametric statistical inference.

Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Posterior-Aided Regularization for Likelihood-Free Inference

CURVATURE AND INFERENCE FOR MAXIMUM LIKELIHOOD ESTIMATESstatweb.stanford.edu/~ckirby/brad/papers/2016CurvatureInferenceMLEs.pdf · CURVATURE AND INFERENCE FOR MAXIMUM LIKELIHOOD ESTIMATES

Parametric Inference for Biological Sequence Analysispeople.csail.mit.edu/jrennie/trg/papers/pachter-biosequence-04.pdf · Parametric Inference for Biological Sequence Analysis Lior

1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

CURVATURE AND INFERENCE FOR MAXIMUM LIKELIHOOD …

STAT 830 Non-parametric Inference Basics

QUASI-MAXIMUM LIKELIHOOD ESTIMATION AND INFERENCE …econ.duke.edu/~boller/Published_Papers/ectrev_92.pdf · QUASI-MAXIMUM LIKELIHOOD ESTIMATION AND INFERENCE IN ... We study the

Monte Carlo likelihood inference for marked doubly ...dse.univr.it/home/workingpapers/2012WP11MinozzoCentanniMClik.pdf · Monte Carlo likelihood inference for marked doubly stochastic

A Likelihood-Free Inference Framework for Population ...papers.nips.cc/paper/8078-a-likelihood-free... · A Likelihood-Free Inference Framework for Population Genetic Data using Exchangeable

Likelihood inference for a nonstationary fractional ...qed.econ.queensu.ca/working_papers/papers/qed_wp_1172.pdf · Likelihood inference for a nonstationary fractional autoregressive

APPLIED NONPARAMETRIC METHODS - UITSweb.uconn.edu/tripathi/397/Applied nonparametric methods.pdf · specified parametric models and likelihood based methods of inference. Under regularity

Nonparametric inference with generalized likelihood ratio ...jqfan/papers/07/GLRtest.pdf · Nonparametric inference with generalized likelihood ratio tests? Jianqing Fan ¢ Jiancheng

Bayesian Inference with Engineered Likelihood Functions for …€¦ · 3.2 Bayesian inference with engineered likelihood functions With the model of (noisy) engineered likelihood

Likelihood, Inference, and Model Comparison

Towards Likelihood Free Inference

Refined Non Parametric Methods for Genomic inference Refined Non Parametric Methods for Genomic inference Peter J. Bickel Department of Statistics University.

Semi-parametric and Parametric Inference of Extreme …amir.eng.uci.edu/publications/10_Extr_Rain_WRM.pdf · 2010-03-27 · Semi-parametric and Parametric Inference of Extreme ...

Inference in non parametric Hidden Markov Models