Transcript of Professor William Greene Stern School of Business IOMS Department Department of Economics...
- Slide 1
- Professor William Greene Stern School of Business IOMS
Department Department of Economics Statistical Inference and
Regression Analysis: Stat-GB.3302.30, Stat-UB.0015.01
- Slide 2
- 2/98 Part 3 Estimation Theory Immediate Reaction to the WHR
Health System Performance Report New York Times, June 21, 2000
- Slide 3
- 3/98 Part 3 Estimation Theory A Model of the Best a Country
Could Do vs. what They Actually Do
- Slide 4
- 4/98 Part 3 Estimation Theory The following was taken from
http://www.msnbc.msn.com/id/27339545/ An msnbc.com guide to
presidential polls Why results, samples and methodology vary from
survey to survey WASHINGTON - A poll is a small sample of some
larger number, an estimate of something about that larger number.
For instance, what percentage of people reports that they will cast
their ballots for a particular candidate in an election? A sample
reflects the larger number from which it is drawn. Lets say you had
a perfectly mixed barrel of 1,000 tennis balls, of which 700 are
white and 300 orange. You do your sample by scooping up just 50 of
those tennis balls. If your barrel was perfectly mixed, you wouldnt
need to count all 1,000 tennis balls your sample would tell you
that 30 percent of the balls were orange.
- Slide 5
- 5/98 Part 3 Estimation Theory Use random samples and basic
descriptive statistics. What is the breach rate in a pool of tens
of thousands of mortgages? (Breach = improperly underwritten or
serviced or otherwise faulty mortgage.)
- Slide 6
- 6/98 Part 3 Estimation Theory The forensic analysis was an
examination of statistics from a random sample of 1,500 loans.
- Slide 7
- Part 3 Estimation Theory
- Slide 8
- 8/98 Part 3 Estimation Theory Estimation Nonparametric
population features Mean - income Correlation disease incidence and
smoking Ratio income per household member Proportion proportion of
ASCAP music played that is produced by Dave Matthews Distribution
histogram and density estimation Parameters Fitting distributions
mean and variance of lognormal distribution of income Parametric
models of populations relationship of loan rates to attributes of
minorities and others in Bank of America settlement on mortgage
bias 8
- Slide 9
- 9/98 Part 3 Estimation Theory Measurements as Observations
Population Measurement Theory Characteristics Behavior Patterns
Choices The theory argues that there are meaningful quantities to
be statistically analyzed.
- Slide 10
- 10/98 Part 3 Estimation Theory Application Health and Income
German Health Care Usage Data, 7,293 Households, Observed 1984-1995
Data downloaded from Journal of Applied Econometrics Archive. Some
variables in the file are DOCVIS = number of visits to the doctor
in the observation period HOSPVIS = number of visits to a hospital
in the observation period HHNINC = household nominal monthly net
income in German marks / 10000. (4 observations with income=0 were
dropped) HHKIDS = children under age 16 in the household = 1;
otherwise = 0 EDUC = years of schooling AGE = age in years PUBLIC =
decision to buy public health insurance HSAT = self assessed health
status (0,1,,10)
- Slide 11
- 11/98 Part 3 Estimation Theory Observed Data 11
- Slide 12
- 12/98 Part 3 Estimation Theory Inference about Population
Population Measurement Characteristics Behavior Patterns
Choices
- Slide 13
- 13/98 Part 3 Estimation Theory Classical Inference Population
Measurement Characteristics Behavior Patterns Choices Imprecise
inference about the entire population sampling theory and
asymptotics Sample The population is all 40 million German
households (or all households in the entire world). The sample is
the 7,293 German households in 1984-1995.
- Slide 14
- 14/98 Part 3 Estimation Theory Bayesian Inference Population
Measurement Characteristics Behavior Patterns Choices Sharp, exact
inference about only the sample the posterior density is posterior
to the data. Sample
- Slide 15
- 15/98 Part 3 Estimation Theory Estimation of Population
Features Estimators and Estimates Estimator = strategy for use of
the data Estimate = outcome of that strategy Sampling Distribution
Qualities of the estimator Uncertainty due to random sampling
15
- Slide 16
- 16/98 Part 3 Estimation Theory Estimation Point Estimator:
Provides a single estimate of the feature in question based on
prior and sample information. Interval Estimator: Provides a range
of values that incorporates both the point estimator and the
uncertainty about the ability of the point estimator to find the
population feature exactly. 16
- Slide 17
- 17/98 Part 3 Estimation Theory Repeated Sampling - A Sampling
Distribution The true mean is 500. Sample means vary around 500,
some quite far off. The sample mean has a sampling mean and a
sampling variance. The sample mean also has a probability
distribution. Looks like a normal distribution. This is a histogram
for 1,000 means of samples of 20 observations from Normal[500,100 2
].
- Slide 18
- 18/98 Part 3 Estimation Theory Application: Credit Modeling
1992 American Express analysis of Application process: Acceptance
or rejection; X = 0 (reject) or 1 (accept). Cardholder behavior
Loan default (D = 0 or 1). Average monthly expenditure (E =
$/month) General credit usage/behavior (Y = number of charges)
13,444 applications in November, 1992
- Slide 19
- 19/98 Part 3 Estimation Theory 0.7809 is the true proportion in
the population of 13,444 we are sampling from.
- Slide 20
- 20/98 Part 3 Estimation Theory Estimation Concepts Random
Sampling Finite populations i.i.d. sample from an infinite
population Information Prior Sample 20
- Slide 21
- 21/98 Part 3 Estimation Theory Properties of Estimators 21
- Slide 22
- 22/98 Part 3 Estimation Theory Unbiasedness The sample mean of
the 100 sample estimates is 0.7844. The population mean (true
proportion) is 0.7809.
- Slide 23
- 23/98 Part 3 Estimation Theory N=144 N=1024 N=4900.7 to.88
Consistency
- Slide 24
- 24/98 Part 3 Estimation Theory 24 Bank costs are normally
distributed with mean . Which is a better estimator of , the mean
(11.46) or the median (11.27)? Competing Estimators of a
Parameter
- Slide 25
- 25/98 Part 3 Estimation Theory Interval estimates of the
acceptance rate Based on the 100 samples of 144 observations
- Slide 26
- 26/98 Part 3 Estimation Theory Methods of Estimation
Information about the source population Approaches Method of
Moments Maximum Likelihood Bayesian 26
- Slide 27
- 27/98 Part 3 Estimation Theory The Method of Moments
- Slide 28
- 28/98 Part 3 Estimation Theory Estimating a Parameter Mean of
Poisson p(y)=exp(-) y / y!, y = 0,1,; > 0 E[y]= . E[(1/N) i y i
]= . This is the estimator Mean of Exponential f(y) = exp(- y), y
> 0; > 0 E[y] = 1/ . E(1/N) i y i = 1/ . 1/{(1/N) i y i } is
the estimator of
- Slide 29
- 29/98 Part 3 Estimation Theory Mean and Variance of a Normal
Distribution
- Slide 30
- 30/98 Part 3 Estimation Theory Proportion for Bernoulli In the
AmEx data, the true population acceptance rate is 0.7809 = Y = 1 if
application accepted, 0 if not. E[y] = E[(1/N) i y i ] = p accept =
. This is the estimator 30
- Slide 31
- 31/98 Part 3 Estimation Theory Gamma Distribution
- Slide 32
- 32/98 Part 3 Estimation Theory Method of Moments (P) = (P) /
(P) = dlog (P)/dP
- Slide 33
- 33/98 Part 3 Estimation Theory 33
- Slide 34
- 34/98 Part 3 Estimation Theory Estimate One Parameter Assume
known to be 0.1. Estimate P E[y] = P/ = P/.1 = 10P m 1 = mean of y
= 31.278 Estimate of P is 31.278/10 = 3.1278. One equation in one
unknown 34
- Slide 35
- 35/98 Part 3 Estimation Theory Application
- Slide 36
- 36/98 Part 3 Estimation Theory Method of Moments Solutions
create ; y1=y ; y2=log(y) ; ysq=y*y$ calc ; m1=xbr(y1) ;
mlog=xbr(y2); m2=xbr(ysq) $ Minimize; start = 2.0,.06 ; labels =
p,l ; fcn= (m1 - p/l)^2 + (mlog (psi(p)-log(l)))^2 $
---------------------------------------------------- P| 2.41074
L|.07707 --------+-------------------------------------------
Minimize; start = 2.0,.06 ; labels = p,l ; fcn= (m1 - p/l)^2 + (m2
p*(p+1)/l^2 )^2 $
--------+------------------------------------------- P| 2.06182
L|.06589 --------+-------------------------------------------
- Slide 37
- 37/98 Part 3 Estimation Theory Properties of MoM estimator
Unbiased? Sometimes, e.g., normal, Bernoulli and Poisson means
Consistent? Yes by virtue of Slutsky Theorem Assumes parameters can
vary continuously Assumes moment functions are continuous and
smooth Efficient? Maybe remains to be seen. (Which pair of moments
should be used for the gamma distribution?) Sampling distribution?
Generally normal by virtue of Lindeberg-Levy central limit theorem
and the Slutsky theorem. 37
- Slide 38
- 38/98 Part 3 Estimation Theory Estimating Sampling Variance
Exact sampling results Poisson Mean, Normal Mean and Variance
Approximation based on linearization Bootstrapping discussed later
with maximum likelihood estimator. 38
- Slide 39
- 39/98 Part 3 Estimation Theory Exact Variance of MoM Estimate
normal or Poisson mean Estimator is sample mean = (1/N) i Y i.
Exact variance of sample mean is 1/N * population variance. 39
- Slide 40
- 40/98 Part 3 Estimation Theory Linearization Approach 1
Parameter 40
- Slide 41
- 41/98 Part 3 Estimation Theory Linearization Approach 1
Parameter 41
- Slide 42
- 42/98 Part 3 Estimation Theory Linearization Approach - General
42
- Slide 43
- 43/98 Part 3 Estimation Theory Exercise: Gamma Parameters m 1 =
1/N y i => P/ m 2 = 1/N y i 2 => P(P+1)/ 2 1. What is the
Jacobian? (Derivatives) 2. How to compute the variance of m 1, the
variance of m 2 and the covariance of m 1 and m 2 ? (The variance
of m 1 is 1/N times the variance of y; the variance of m 2 is 1/N
times the variance of y 2. The covariance is 1/N times the
covariance of y and y 2.) 43
- Slide 44
- 44/98 Part 3 Estimation Theory Sufficient Statistics 44
- Slide 45
- 45/98 Part 3 Estimation Theory Sufficient Statistic 45
- Slide 46
- 46/98 Part 3 Estimation Theory Sufficient Statistic 46
- Slide 47
- 47/98 Part 3 Estimation Theory Sufficient Statistics 47
- Slide 48
- 48/98 Part 3 Estimation Theory Gamma Density 48
- Slide 49
- 49/98 Part 3 Estimation Theory Rao Blackwell Theorem The mean
squared error of an estimator based on sufficient statistics is
smaller than one not based on sufficient statistics. We deal in
consistent estimators, so a large sample (approximate) version of
the theorem is that estimators based on sufficient statistics are
more efficient than those that are not. 49
- Slide 50
- 50/98 Part 3 Estimation Theory Maximum Likelihood Estimation
Criterion Comparable to method of moments Several virtues: Broadly,
uses all the sample and nonsample information available efficient
(better than MoM in many cases) 50
- Slide 51
- 51/98 Part 3 Estimation Theory Setting Up the MLE The
distribution of the observed random variable is written as a
function of the parameter(s) to be estimated P(y i | ) =
Probability density of data | parameters. L( |y i ) = likelihood of
parameter | data The likelihood function is constructed from the
density Construction: Joint probability density function of the
observed sample of data generally the product when the data are a
random sample. The estimator is chosen to maximize the likelihood
of the data (essentially the probability of observing the sample in
hand).
- Slide 52
- 52/98 Part 3 Estimation Theory Regularity Conditions Why?
Regular MLE has known, good properties. Nonregular estimators
usually do not have known properties (good or bad). What they are
1. logf(.) has three continuous derivatives wrt parameters 2.
Conditions needed to obtain expectations of derivatives are met.
(E.g., range of the variable is not a function of the parameters.)
3. Third derivative has finite expectation. What they mean Moment
conditions and convergence. We need to obtain expectations of
derivatives. We need to be able to truncate Taylor series. We will
use central limit theorems MLE exists for nonregular densities (see
text). Questionable statistical properties.
- Slide 53
- 53/98 Part 3 Estimation Theory Regular Exponential Density
Exponential density f(y i | )=(1/ )exp(-y i / ) Average time until
failure, , of light bulbs. y i = observed life until failure.
Regularity (1) Range of y is 0 to free of (2) logf(y i | ) = -log
y/ logf(y i | )/ = -1/ + y i / 2 E[y i ]= , E[logf( )/ ]=0 (3) 2
logf(y i | )/ 2 = 1/ 2 - 2y i / 3 finite expectation = -1/ 2 (4) 3
logf(y i | )/ 3 = -2/ 3 + 6y i / 4 has finite expectation = 4/ 3
(5) All derivatives are continuous functions of
- Slide 54
- 54/98 Part 3 Estimation Theory Likelihood Function L( )= i f(y
i | ) MLE = the value of that maximizes the likelihood function.
Generally easier to maximize the log of L. The same maximizes log L
In random sampling, logL= i log f(y i | ) 54
- Slide 55
- 55/98 Part 3 Estimation Theory Poisson Likelihood 55 log and ln
both mean natural log throughout this course
- Slide 56
- 56/98 Part 3 Estimation Theory The MLE The log-likelihood
function: log-L( |data)= i logf(y i | ) The likelihood equation(s)
= first derivative: First derivatives of log-L equals zero at the
MLE. [ i logf(y i | )]/ MLE = 0. (Interchange summation and
differentiation) i [logf(y i | )/ MLE ]= 0.
- Slide 57
- 57/98 Part 3 Estimation Theory Applications Bernoulli
Exponential Poisson Normal Gamma 57
- Slide 58
- 58/98 Part 3 Estimation Theory Bernoulli 58
- Slide 59
- 59/98 Part 3 Estimation Theory Exponential Estimating the
average time until failure, , of light bulbs. y i = observed life
until failure. f(y i | )=(1/ )exp(-y i / ) L( )= i f(y i | )= -N
exp(-y i / ) logL ( )=-Nlog ( ) - y i / Likelihood equation: logL(
)/ =-N/ + y i / 2 =0 Solution: (Multiply both sides of equation by
2 ) = y i /N (sample average estimates population average)
- Slide 60
- 60/98 Part 3 Estimation Theory Poisson Distribution 60
- Slide 61
- 61/98 Part 3 Estimation Theory Normal Distribution 61
- Slide 62
- 62/98 Part 3 Estimation Theory Gamma Distribution 62 (P) = (P)
/ (P) = dlog (P)/dP
- Slide 63
- 63/98 Part 3 Estimation Theory Gamma Application 63 Gamma
(Loglinear) Regression Model Dependent variable Y Log likelihood
function -85.37567
--------+----------------------------------------------------------------
| Standard Prob. 95% Confidence Y| Coefficient Error z |z|>Z*
Interval
--------+----------------------------------------------------------------
|Parameters in conditional mean function LAMBDA|.07707***.02544
3.03.0024.02722.12692 |Scale parameter for gamma model P_scale|
2.41074***.71584 3.37.0008 1.00757 3.81363
--------+----------------------------------------------------------------
SAME SOLUTION AS METHOD OF MOMENTS USING M1 and Mlog create ; y1=y
; y2=log(y) $ calc ; m1=xbr(y1) ; mlog=xbr(y2) $ Minimize; start =
2.0,.06 ; labels = p,l ; fcn= (m1 - p/l)^2 + (mlog
(psi(p)-log(l)))^2 $
------------------------------------------------------------ P|
2.41074 L|.07707
--------+---------------------------------------------------
- Slide 64
- 64/98 Part 3 Estimation Theory Properties of the MLE Estimator
Regularity Finite sample vs. asymptotic properties Properties of
the estimator Information used in estimation 64
- Slide 65
- 65/98 Part 3 Estimation Theory Properties of the MLE Sometimes
unbiased, usually not Always consistent (under regularity) Large
sample normal distribution Efficient Invariant Sufficient (uses
sufficient statistics when they exist) 65
- Slide 66
- 66/98 Part 3 Estimation Theory Unbiasedness Usually when
estimating a parameter that is the mean of the random variable
Normal mean Poisson mean Bernoulli probability is the mean. Does
not make degrees of freedom corrections Almost no other cases.
66
- Slide 67
- 67/98 Part 3 Estimation Theory Consistency Under regularity MLE
is consistent. Without regularity, it may be consistent, but
usually cannot be proved. Almost all cases, mean square consistent
Expectation converges to the parameter Variance converges to zero.
(Proof sketched in Rice text, 275-276) 67
- Slide 68
- 68/98 Part 3 Estimation Theory Large Sample Distribution
- Slide 69
- 69/98 Part 3 Estimation Theory The Information Equality
- Slide 70
- 70/98 Part 3 Estimation Theory Deduce The Variance of MLE
- Slide 71
- 71/98 Part 3 Estimation Theory Computing the Variance of the
MLE
- Slide 72
- 72/98 Part 3 Estimation Theory Application: GSOEP Income
Descriptive Statistics for 1 variables
--------+---------------------------------------------------------------------
Variable| Mean Std.Dev. Minimum Maximum Cases Missing
--------+---------------------------------------------------------------------
HHNINC|.355564.166561.030000 2.0 2698 0
--------+---------------------------------------------------------------------
- Slide 73
- 73/98 Part 3 Estimation Theory Variance of MLE
- Slide 74
- 74/98 Part 3 Estimation Theory Bootstrapping Given the sample,
i = 1,,N Sample N observations with replacement some get picked
more than once, some do not get picked. Recompute estimate of .
Repeat R times, obtain R new estimates of . Estimate variance with
the sample variance of the R new estimates.
- Slide 75
- 75/98 Part 3 Estimation Theory Bootstrap Results Estimated
Variance =.00311 2.
- Slide 76
- 76/98 Part 3 Estimation Theory Sufficiency If sufficient
statistics exist, the MLE will be a function of them Therefore, MLE
satisfies the Rao Blackwell Theorem (in large samples).
- Slide 77
- 77/98 Part 3 Estimation Theory Efficiency Cramer Rao Lower
Bound Variance of a consistent, asymptotically normally distributed
estimator is > -1/{NE[H i ( )]}. The MLE achieves the C-R lower
bound, so it is efficient. Implication: For normal sampling, the
mean is better than the median.
- Slide 78
- 78/98 Part 3 Estimation Theory Invariance
- Slide 79
- 79/98 Part 3 Estimation Theory Bayesian Estimation
Philosophical underpinnings How to combine information contained in
the sample
- Slide 80
- 80/98 Part 3 Estimation Theory Estimation Assembling
information Prior information = out of sample. Literally prior or
outside information Sample information is embodied in the
likelihood Result of the analysis: Posterior belief = blend of
prior and likelihood
- Slide 81
- 81/98 Part 3 Estimation Theory Using Conditional Probabilities:
Bayes Theorem Typical application: We know P(B|A), we want P(A|B)
In drug testing: We know P(find evidence of drug use | usage) <
1. We needP(usage | find evidence of drug use). The problem is
false positives. P(find evidence drug of use | Not usage) > 0
This implies that P(usage | find evidence of drug use) 1
- Slide 82
- 82/98 Part 3 Estimation Theory Bayes Theorem
- Slide 83
- 83/98 Part 3 Estimation Theory Disease Testing Notation + =
test indicates disease, = test indicates no disease D = presence of
disease, N = absence of disease Known Data P(Disease) = P(D) =.005
(Fairly rare) (Incidence) P(Test correctly indicates disease) =
P(+|D) =.98 ( Sensitivity ) (Correct detection of the disease)
P(Test correctly indicates absence) = P(-|N) =. 95 (Specificity)
(Correct failure to detect the disease) Objectives: Deduce these
probabilities P(D|+) (Probability disease really is present | test
positive) P(N|) (Probability disease really is absent | test
negative) Note, P(D|+) = the probability that a patient actually
has the disease when the test says they do.
- Slide 84
- 84/98 Part 3 Estimation Theory More Information Deduce: Since
P(+|D)=.98, we know P(|D)=.02 because P(-|D)+P(+|D)=1 [P(|D) is the
P(False negative). Deduce: Since P(|N)=.95, we know P(+|N)=.05
because P(-|N)+P(+|N)=1 [P(+|N) is the P(False positive). Deduce:
Since P(D)=.005, we know P(N)=.995 because P(D)+P(N)=1.
- Slide 85
- 85/98 Part 3 Estimation Theory Now, Use Bayes Theorem
- Slide 86
- 86/98 Part 3 Estimation Theory Bayesian Investigation No fixed
parameters. is a random variable. Data are realizations of random
variables. There is a marginal distribution p(data) Parameters are
part of the random state of nature, p( ) = distribution of
independently (prior to) the data Investigation combines sample
information with prior information. Outcome is a revision of the
prior based on the observed information (data)
- Slide 87
- 87/98 Part 3 Estimation Theory
- Slide 88
- 88/98 Part 3 Estimation Theory Symmetrical Treatment Likelihood
is p(data| ) Prior distribution summarizes the nonsample
information about in p( ) Joint distribution is p(data, ) P(data, )
= p(data| )p( )=Likelihood x Prior Use Bayes theorem to get p(
|data) = posterior distribution
- Slide 89
- 89/98 Part 3 Estimation Theory The Posterior Distribution
- Slide 90
- 90/98 Part 3 Estimation Theory Priors Where do they come from?
What does the prior contain Informative priors real prior
information Noninformative priors Mathematical Complications
Diffuse Uniform Normal with huge variance Improper priors Conjugate
priors
- Slide 91
- 91/98 Part 3 Estimation Theory Application Consider estimation
of the probability that a production process will produce a
defective product. In case 1, suppose the sampling design is to
choose N = 25 items from the production line and count the number
of defectives. If the probability that any item is defective is a
constant between zero and one, then the likelihood for the sample
of data is L( | data) = D (1 ) 25D, where D is the number of
defectives, say, 8. The maximum likelihood estimator of will be q =
D/25 = 0.32, and the asymptotic variance of the maximum likelihood
estimator is estimated by q(1 q)/25 = 0.008704.
- Slide 92
- 92/98 Part 3 Estimation Theory Application: Posterior
Density
- Slide 93
- 93/98 Part 3 Estimation Theory Posterior Moments
- Slide 94
- 94/98 Part 3 Estimation Theory Mixing Prior and Sample
Information
- Slide 95
- 95/98 Part 3 Estimation Theory Modern Bayesian Analysis
Bayesian Estimate of Theta Observations = 5000 (Posterior mean
was.333333) Mean =.334017 Standard Deviation =.086336 Posterior
Variance =.007936 Sample variance =.007454 Skewness =.248077
Kurtosis-3 (excess)= -.161478 Minimum =.066214 Maximum =.653625.025
Percentile =.177090.975 Percentile -.510028
- Slide 96
- 96/98 Part 3 Estimation Theory Modern Bayesian Analysis
Multiple parameter settings Derivation of exact form of
expectations and variances for p( 1, 2,, K |data) is hopelessly
complicated even if the density is tractable. Strategy: Sample
joint observations ( 1, 2,, K ) from the posterior population and
use marginal means, variances, quantiles, etc. How to sample the
joint observations??? (Still hopelessly complicated.)
- Slide 97
- 97/98 Part 3 Estimation Theory Magic: The Gibbs Sampler
Objective: Sample joint observations on 1, 2,, K. from p( 1, 2,, K
|data) (Let K = 3) Strategy: Gibbs sampling: Derive p( 1 | 2,
3,data) p( 2 | 1, 3,data) p( 3 | 1, 2,data) Gibbs Cycles produce
joint observations 0. Start 1, 2, 3 at some reasonable values 1.
Sample a draw from p( 1 | 2, 3,data) using the draws of 1, 2 in
hand 2. Sample a draw from p( 2 | 1, 3,data) using the draw at step
1 for 1 3. Sample a draw from p( 3 | 1, 2,data) using the draws at
steps 1 and 2 4. Return to step 1. After a burn in period (a few
thousand), start collecting the draws. The set of draws ultimately
gives a sample from the joint distribution.
- Slide 98
- 98/98 Part 3 Estimation Theory Methodological Issues Priors:
Schizophrenia Uninformative are disingenuous Informative are not
objective Using existing information? Bernstein von Mises and
likelihood estimation. In large samples, the likelihood dominates
The posterior mean will be the same as the MLE