Chapter 5 - Introduction to Probability and Statistics · Chapter 5 - Introduction to Probability...
Transcript of Chapter 5 - Introduction to Probability and Statistics · Chapter 5 - Introduction to Probability...
Chapter 5 - Introduction to Probability and Statistics∗
Simona Helmsmueller†
These lecture notes are meant to be used by students entering the University of Mannheim
Master program in Economics. They constitute the base for a pre-course in mathematics;
that is, they summarize elementary concepts with which all of our econ grad students must
be familiar. More advanced concepts will be introduced later on in the regular coursework.
A thorough knowledge of these basic notions will be assumed in later
coursework.
Although the wording is my own, the definitions of concepts and the ways to approach them
is strongly inspired by various sources, which are mentioned explicitly in the text or at the
end of the chapter.
Simona Helmsmueller
∗This version: 2017†Center for Doctoral Studies in Economic and Social Sciences.
1
Contents
1 Introduction 4
2 Probability theory 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Random variables, distributions and their features . . . . . . . . . . . . . . . 7
2.3 Joint variables, distributions and their features . . . . . . . . . . . . . . . . . 11
3 Statistics 13
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Finite sample properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Large sample properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2
Preview
What you should take away from this chapter:
1. Introduction
• You should grasp the importance of stochastics and statistics in all areas of eco-
nomic science.
2. Probability theory
• You should know what random variable, its distribution function and correspond-
ing density is, and know the calculation rules that apply
• You should be able to draw some pictures illustrating cdfs and their densities,
and be able to show how the picture changes with changing mean and variance.
• You should know the definition and properties of mean and variance.
• You should have a good intuition for the terms joint distribution, marginal dis-
tribution and conditional distribution.
• You should be able to check for independence of two variables, and be able to
calculate the covariance and correlation between the two.
3. Statistics
• You should understand that statistics is about samples and estimation of some
quantities of interest (in our case: parameters of the distribution function).
• You should know the difference between the terms population variance, variance
estimator, and variance of the estimator.
• You should know criteria for assessing the quality of an estimator.
• You should have a graphical and analytical understanding of the terms bias, vari-
ance and consistency of an estimator.
• You should have an intuitio and a graphical understanding of the law of large
numbers and the central limit theorem, and be able to write both of them down
in your sleep!
3
1 Introduction
Modern economics is a study of uncertain behavior to uncertain events under incomplete
information. Understanding this uncertainty, or randomness, and the mathematical concepts
associated with it is therefore an integral prerequisite for any economist. I could easily have
filled another week of preparatory classes on stochastics and statistics. However, we do not
have the time for this, and I am therefore forced to reduce this to a very short introduction of
the key concepts. The following is heavily leaning on the annexes B and C from Wooldridge’s
Introductory Econometrics, a modern approach (2013). All graphs included in these lecture
notes are taken from the book. I strongly urge you to read the roughly 80 pages in the book.
They are a relatively easy read and a thorough understanding of the concepts laid out there
will save you a lot of future trouble. This is not only valid for the following econometric
classes but also for any micro and macro lecture. I doubt you will have any lecture where
no expected value pops up at least once. Making sure you understand the language of these
classes will allow you to focus on the plot!
Here are a couple of examples where probability theory or statistics come into play:
1. In a duopoly, two companies A and B set their prices at the same time. Whether or not
company A gets most (or all) of the customers depends not only on the price of that
company but also on the price the competitor offers. The price of company B on the
other hand depends on that company’s production cost and pricing strategy. While
company A might have some information about the production technology of company
B (e.g. by considering the historic prices or some insider knowledge), some uncertainty
remains. For its own pricing strategy, company A therefore considers different scenarios
and, as a rational agent, will decide for the one which promises the highest expected
revenues.
2. A state considering to raise its unemployment benefits will also consider the impact
this has on the labor market. Will wages raise? Will the required higher tax rate to
finance the benefits lead to less consumption? To answer these questions, the state has
to form expectations about the behavior of firms and job-seekers.
3. A consumer considering to buy an insurance needs to weigh the premium of the policy
against the expected loss incured by the adverse event.
4. An economist wants to know whether some policy intervention, e.g. the introduction
of a mandatory health insurance scheme, really has an impact on the beneficiaries’
lives, e.g. on their health status. Because we cannot compare the health status of
the insured people with the health status of the same people but in a parallel universe
without the insurance, we must rely on estimation techniques by first finding a suitable
control group, e.g. the same population before the introduction of a health insurance,
or the population of a different state. Also, the impact is likely to be different for
4
different people, e.g. depending on their pre-insurance health status, their life style or
available health infrastructure. Instead of estimating an effect for each individual, we
consider means over a sample treatment and control population and use statistics to
causally infer about the impact of the insurance on the health status.
2 Probability theory
2.1 Introduction
Probability theory lies the foundation for statistical analysis by providing the formal, math-
ematical framework. If we were to introduce this framework solidly, I would need to start
with an introduction to measure theory. This is a beautiful subject, with unexpected twists
and turns and if you ever feel that you want to dig deeper into higher mathematics, I highly
recommend you get a book on measure theory. Now, however, this preparatory course is
restricted in time and therefore, I will only be able to give you a very brief introduction
into probability theory, mainly consisting of definitions and examples, and leaving out the
general concepts and proofs.
Let me, however, try to give you some intuition on measures and their importance for
probability theory. In mathematics, a measure is nothing else than a function describing
some sort of magnitude of mathematical objects, here: sets. Equivalently to our derivation
of the definition of vector spaces, there are some intuitive criteria which you would require
such a measure function to fulfill. For example, it would be reasonable to assume that the
measure function maps into the positive real numbers, and that if there is nothing inside the
considered object (whatever that may mean), then its measure is 0. You might also want to
ensure that an object which is fully contained within another object has a smaller measure.
There are other things to consider: what would you expect from the measure of the union
of two (disjoint) objects? Can sets that contain elements also have a zero measure? Can the
measure take the value infinity and under what circumstances could this be the case?1
To illustrate this further, let us consider some examples. If the mathematical objects
considered are closed intervals [a, b], a measure could be the length b−a and you might want
to require that this is the same for an open interval (a, b). As another example, if the sets
you consider are countable, then it is straightforward to let the measure function assign each
set the number of elements in it. These examples might sound trivial, but measure theory
extends this basic notions to more general cases. It can even be shown that there even
exists subsets of the real numbers to which one cannot assign a measure with the desired
above-described properties.
Why is this important to probability theory? Because it sets the frame for the formal
definition of what one calls probability in everyday language. This is in fact a highly critically,
1The questions asked here are similar to those used for introducing metrics and norms on vector spaces.The difference here is that we consider general sets without algebraic structure, i.e. addition and scalarmultiplication might not be defined on the set’s elements.
5
even philosophical question. We use terms like probability and likelihood many times each
day and we might feel that it is somehow associated with relative frequencies. So why bother
about a formal definition?
Consider the following question:
In the US, an individual has been described by a neighbor as follows: Steve
is very shy and withdrawn, invariably helpful but with very little interest
in people or in the world of reality. A meek and tidy soul, he has a need
for order and structure, and a passion for detail. Is Steve more likely to
be a librarian or a farmer?
As Kahneman and Tversky observe, the majority of respondents assign a higher likelihood
to Steve being a librarian. We do not take into consideration the relative frequencies: in
the US there are five times more farmers than librarians - and this ratio is even higher for
male farmers compared to male librarians. In the absence of a formal concept, we might be
intrigued to be misguided by unrelated information.
Now look at the following story known as the Linda problem by Kahneman and Tversky.
Answer quickly what comes to your mind!
Linda is thirty-one years old, single, outspoken, and very bright. She
majored in philosophy. As a student, she was deeply concerned with is-
sues of discrimination and social justice, and also participated in anti-
nuclear demonstrations...Please rank in order of likelihood various sce-
narios: Linda is
1. an elementary school teacher,
2. active in the feminist movement,
3. a bank teller,
4. an insurance salesperson, or
5. a bank teller also active in the feminist movement.
Independent of the other rankings, most respondents assign (5) a higher rank than (3).
Now, there are certainly more bank tellers than bank tellers also active in the feminist
movement, as the latter is a proper subset of the former!
These examples show that it is easy to go astray in a world with a lot of information if we
are asked to navigate through this solely with our intuition. Mathematical definitions can
illuminate patterns in this information and this might help us decide on which information
to take seriously and which to discard. This is why we need some probability theory to
successfully master statistics!2
2The examples also quite clearly highlight why the rational agent models used in classical economics are
6
Nevertheless, a typical procedure in econometric classes is to skip the definition of a prob-
ability measure and start with the definition of probability distributions, where probability
is again taken to be a term that needs no further definition. Let me at this stage at least
provide you the formal definition of a general probability measure. It is more general as it
incorporates the discrete and the continuous as well as any mixed case.
Definition 2.1. (Probability measure)
Let S be a set and P(S) the power set3. A function P from P(S) to [0, 1] is called a
probability measure if
1. Non-negativitiy: For all S ∈ P(S) : P (S) ≥ 0.
2. Certain event: P (S) = 1.
3. Countable additivity: For all countable collections {Si}∞i=1 of pairwise disjoint sets
in P(S): P (∪∞k=1Sk) =∞∑k=1
P (Sk).
You should confirm that the intuitive meaning of probability as limit of relative frequen-
cies fulfills this definition. A probability measure defined as such has a useful additional
property: If A ⊂ B, then P (A) ≤ P (B). As an exercise, you should prove this using the
above axioms! This property is what we would have needed to avoid the pitfall in the Linda
problem.
2.2 Random variables, distributions and their features
Having in mind that there is a formal definition of a probability measure which includes but
generalizes our intuitive understanding, we now proceed as in most econometric classes. A
crucial concept is that of a random variable and this can be quite confusing. I copy from
wikipedia to show you what this term encompasses:
In probability and statistics, a random variable [...] is a variable whose value
is subject to variations due to chance [...]. A random variable can take on a set
of possible different values (similarly to other mathematical variables), each with
an associated probability, in contrast to other mathematical variables.
A random variable’s possible values might represent the possible outcomes
of a yet-to-be-performed experiment, or the possible outcomes of a past exper-
iment whose already-existing value is uncertain (for example, due to imprecise
not always a good depiction of reality. Behavioral economics provide an alternative and if you are interestedin this I recommend you read up on prospect theory by Kahnemann and Tversky. Also, Kahnemann’s bookThinking fast and slow is an entertaining and informative read on this strand of economics. Nevertheless, itis unworthy to criticize something you do not firmly understand - so make sure you pay attention to expectedutility theory in your micro lecture. It does have its merits and applications in the real world!
3Remember, the power set is the set that contains all possible subsets of S.
7
measurements [...]). They may also conceptually represent either the results of
an ”objectively” random process (such as rolling a die) or the ”subjective” ran-
domness that results from incomplete knowledge of a quantity.
A more formal definition would again require some measure theory and then we would
introduce a random variable as a measurable function. Although you do not know the
meaning of the term measurable, it might be worthwhile to remember that a random variable
is a function (usually) mapping into the real numbers! It is not a set and it is also not the
probability measure. Instead, it is a function X that takes a subset of possible outcomes
and assigns a value on the real line describing some numerical property that these outcomes
may have. The important property is that we can assign a probability to any argument of
the random variable, i.e., it make sense to write something like P (X < 2) by which we mean
P ({ω : X(ω) < 2}). As an example, consider as outcome the gender of a newborn baby.
The set of possible outcomes is { boy, girl }. A random variable could be defined as follows:
X(ω) =
{1, if ω = boy
0, if ω = girl
Why would we need this? Because on the real line we can add, subtract and multiply
at will, whereas this might not be defined (and more lengthy in notation) on the set of
possible outcomes. As such, we can define another random variable Y which looks at the
gender of two newborn babies by defining Y ({ω1, ω2}) = X(ω1) +X(ω2) and this gives you
the number of boys in the pair. And then we can concisely write P (Y ≥ 1) instead of
P ({boy, girl}, {girl, boy}, {boy, boy}) or even less concise ”the probability that at least one
child is a boy”.
This last sort of quantity arises so often as object of study that we give it its own name
and definition:
Definition 2.2. (Cumulative distribution function)
The CDF of a real-valued random variable X is a function FX which assigns to x the
probability that X will have value less or equal to x:
FX : R→ [0, 1], x 7→ P (X ≤ x).
Theorem 2.1. (Properties of the CDF and definition density)
For any CDF we have the following properties:
1. P (a < X ≤ b) = FX(b)− FX(a).
2. FX is non-decreasing.
3. FX is continuous from the right.
8
4. If X is a discrete random variable attaining the values xi with probability pi then
FX(x) =∑i:xi≤x
pi.
5. If FX is differentiable, then the derivative fX is called the probability density function
and it holds that
• f(x) ≥ 0∀x
•∫ ∞−∞
f(x)dx = 1
• P (a < X ≤ b) = FX(b)− FX(a) =
∫ b
a
f(x)dx.
The following figure illustrates the last property:
Figure 1: The probability that X lies between a and b (Source: Wooldridge)
Distribution functions hold a lot of information. Some of this information can be aggre-
gated in single numbers which allow us to compare two different random variables. These
numbers are measures of location (e.g. mean, median) and dispersion (e.g. variance).
Definition 2.3. (Mean)
If the random variable X is discrete (i.e. has a discrete CDF) and takes on the values xiwith probability pi, then the mean or expected value is defined as the weighted average over
the outcome set with weights equal to the probabilities pi :
µ = E(X) =∑
xipi.
9
If X has a continuous distribution with density function f , then we define the mean or
expected value by
µ = E(X) =
∫ ∞−∞
xf(x)dx.
Remark 1. It follows from the definition of the discrete case that the expected value of rolling
a dice is1
6· 1 +
1
6· 2 +
1
6· 3 +
1
6· 4 +
1
6· 5 +
1
6· 6 = 3.5. Yet, the value 3.5 will never be
the result of one rolling of the dice. This example illustrated that the expected value need
not be the value that you expect with the greatest probability (in terms of econometrics: it
need not be the maximum likelihood estimator)!
Remark 2. In the notation, E is the expectation operator. As explained in a previous chapter,
an operator is a function that takes as argument another function and maps it (in this case)
into R. This highlights again that a random variable is function!
Theorem 2.2. (Properties of the mean)
Let X and Y be two random variables defined on the same outcome space and let a, b, c ∈ R.
Then
1. E(c) = c
2. If X ≤ Y , then E(X) ≤ E(Y ).
3. E(aX + bY ) = aE(X) + bE(Y ).
Proof. All results follow directly from the calculation rules for sums and integrals, see chapter
3 (Multivariate calculus).
A further central measure is the variance, which tells us how far a random variable is on
average from the mean.
Definition 2.4. (Variance and standard deviation)
The variance σ2 of a CDF is defined as follows:
σ2 = E(x− µ)2,
where µ is the expected value of the CDF. It follows that for a discrete distribution function,
we have that
σ2 =∑
(xi − µ)2pi
and in the continuous case
σ2 =
∫(x− µ)2f(x)dx.
The standard deviation σ is the square root of the variance.
The graph below illustrates the definition.
The following is a very useful characterization of the variance:
10
Figure 2: Random variables with the same mean but different variance (Source: Wooldridge,figure B.4)
Theorem 2.3. (Characterization of variance)
For any random variable X with E(X2) <∞, it holds that
σ2(X) = E(X2)− (E(X))2.
Theorem 2.4. (Properties of the variance)
1. σ2(X) = 0⇔ ∃c ∈ R : P (X = c) = 1.
2. For any a, b ∈ R : σ2(aX + b) = a2σ2(X) and σ(aX + b) = |a|σ(X).
2.3 Joint variables, distributions and their features
In econometrics as in other subjects applying statistical methods, one usually is confronted
with more than one random variable. For example, in health economics you might be
interested in the probability that a person visits a hospital conditional on that person being
insured.
Definition 2.5. (Joint distribution of two random variables)
Let X and Y be discrete random variables which take on the values xi and yi respectively.
Then the joint distribution is given by pij = P (X = xi, Y = yj) and the corresponding cdf
is described by
FX,Y (x, y) = P (X ≤ x, Y ≤ y) =∑
{(i,j):xi≤x,yi≤y}
pij.
If X and Y are continuous variables, then their joint cdf is also defined as
FX,Y (x, y) = P (X ≤ x, Y ≤ y),
11
and if the second derivative of this function exists, then the corresponding joint density is
defined by fX,Y (x, y) =∂2F (x, y)
∂x∂y.
Remark 1. As in the one-variable case, the joint density has the properties that f(x, y) > 0
and
∫ ∫f(x, y)dxdy = 14.
Remark 2. Two important related concepts are that of the marginal distribution and the
conditional distribution. I here aim to only explain the concept: Imagine you have two
random variables, e.g. X measuring the sex of a person and Y the height of a person. The
marginal distribution of Y then describes the distribution of height in the whole population,
and the marginal distribution of X the distribution of sex in the whole population. In
contrast, the conditional distribution gives the distribution of one variable contingent on the
value of the other variable. For example, the distribution of Y conditional on X = 1 gives the
distribution of height amongst the male population. Marginal and conditional distributions
can coincide, and then the random variables are said to be independent.
Definition 2.6. (Independence)
Two discrete random variables X and Y are said to be independent if for all (x, y) it holds
that
P (X = x, Y = y) = P (X = x)P (Y = y).
For continuous distributions, this translates into
fX,Y (x, y) = fX(x)fY (y).
Let us conclude this section with one more important definition, which defines a measure
of how, on average, two variables vary with each other.
Definition 2.7. (Covariance and correlation)
The covariance of two random variables X and Y is defined as
Cov(X, Y ) = E[(X − µX)(Y − µY )],
where µX , µY are the means of the random variables. In the continuous case, this is equal
to
∫ ∫(x− µX)(y − µY )fX,Y (x, y)dxdy.
The correlation coefficient between X and Y is then given by
ρX,Y =Cov(X, Y )
σXσY,
where σX , σY are the standard deviations of the variables.
4Remember Fubini?
12
Remark 3. To get an intuition for the concept of covariance, it is useful to consider the
following: Let CovX,Y > 0. Then, if X is above its mean, then, on average, Y is also above
its mean, and vice versa (think: height and weight of people). This relationship is also often
formulated as the covariance measuring the amount of linear relationship between the two
variables. A positive covariance shows that the two variables move in the same direction,
and a negative covariance shows that the two move in opposite directions.
Theorem 2.5. (Properties of covariance and correlation)
Let X, Y be two random variables with mean µX , µY .
1. Cov(X, Y ) = E(X, Y )− µXµY
2. X, Y independent ⇒ Cov(X, Y ) = 0 (the converse is not true!)
3. Cov(a1X + b1, a2Y + b2) = a1a2Cov(X, Y ) and ρa1X+b1,a2Y+b2 = ρX,Y
4. −1 ≤ ρX,Y ≤ 1 and if ρX,Y = 1, then there exist a, b such that Y = a+ bX.
Theorem 2.6. (Variance and covariance)
For two variables X, Y it is
V ar(aX + bY ) = a2V ar(X) + b2V ar(Y ) + 2abCov(X, Y ).
Remark 4. Further, very elementary concepts are that of conditional mean and conditional
variance. Make sure you read up on this if you are unsure about the concept.
3 Statistics
3.1 Introduction
Imagine you want to know what the average income of people in Germany is. It would be
quite cumbersome to ask every person individually. You are better off conducting a survey,
in which you only ask a number of people (your sample) about their income. You hope that
the information obtained from your sample is somehow representative for all of Germany.
For example, you might want to infer from the average income of your sample to the mean
income of all people in Germany.
What you are in fact doing is a parameter estimation, which can be modeled as follows:
There is a true distribution of income in Germany, called fθ. You must assume some func-
tional form of your distribution, e.g. that it is a normal distribution, but you do not know
the parameter θ, which could for example be θ = (µ, σ), the mean and variance of the normal
distribution. Instead, your aim is to estimate θ based on the observations y1, .., yn in your
sample, which are realizations of the random variable ”income”.
A straightforward approach is the so-called method of moments. If you can express θ as a
function of population moments, then you would simply replace them by the sample moments
13
and obtain your estimate θ. In the above example, you could estimate the mean income in
Germany by the average income in your sample. There are other methods, such as least
squares and maximum likelihood estimation, which you will encounter in your econometrics
class. You could also define the income of person number 10 in your sample as your estimate.
Would this be a good choice? It is the aim of this section to find criteria which we expect a
good estimator to fulfill.
Now, imagine you conduct the above described survey on income multiple times, each
time drawing a new sample but asking the same question. Do you think you would also
get the same average income? Probably not. Your estimate is by itself a random variable
because its value depends on the sample you draw. Therefore, we can also look at the sample
distribution of sample means and the sampling variance of the mean and other properties.
The graph below nicely illustrates the concept.
Figure 3: Distribution of variable y from sample realizations and the sampling distributionof the mean (Source: Groves et al.: Survey methodology (2nd edition), figure 4.2)
Warning: It is easy to get confused with all the different variances. There are at least
three types: 1. the unknown variance of the population which we might like to estimate, σ2,
2. the estimate σ2 which we obtain from the one sample we drew, 3. the unknown variance
of the distribution of our estimate, which describes how an estimate would change if we were
to draw the sample again and again. Clearly distinguishing between all of these will make it
a lot easier to understand the difference between the concepts in the next two subsections.
In the terms of econometrics, we consider a random sample as a series of random variables
Y1, ..., Yn which are independent and identically distributed (i.i.d.). As an example, think of
rolling a dice n times. Typically, the actually measured outcome is denoted in small letters:
y1, ..., yn. These are the realizations of the random variables. We are interested in some
parameter θ0 of the unknown distribution of the Yi and we estimate this by applying an
14
estimator:
θ = h(Y1, ..., Yn).
Once we plug the observed values y1, .., yn into the function h, we derive the estimate θ0.
As discussed in the above chapter, applying a function on random variables delivers another
random variables with a new distribution. As the distribution of Yi is unknown, so is the
distribution of h.
3.2 Finite sample properties
In the following, let us for simplicity assume that θ contains only one unknown parameter
of interest.
The most fundamental properties of an estimator are that of bias and variance. These
are nicely illustrated in the following dart throwing graphs.
Figure 4: Illustration of bias and variance
Variance and bias together make up the mean squared error, which is a measure for how
far off our estimate is of the true parameter on average:
MSE(θ) = E[(θ − θ0)2] = var(θ) + (E[θ]− θ0)2.
The first term on the right-hand side is the variance and the second is the squared bias. (To
derive this equality, note that θ0 is not a random variable, but a constant.)
It gives us two criteria at hand to evaluate the quality of an estimator.
Definition 3.1. (Unbiased and efficient estimators)
An estimator θ of θ0 is said to be unbiased if for all possible values of θ0
E(θ) = θ0.
15
Furthermore, the estimator is said to be efficient within a class of estimators, if it minimizes
the variances within this class.
Remark 1. One might be tempted to discard directly any biased estimators. However, when
comparing an estimator with a nonzero but small bias and a small variance to an unbiased
estimator with a large variance, the former might still be the better choice. This is because
we only have one sample at hands to derive at an estimate and hence the outcome of the
biased estimator is in general closer to θ0 (i.e. has a smaller mean squared error).
Example 3.1. We show that the sample average Y is indeed an unbiased estimator of the
mean µ of a random variable, even if we do not know the distribution of these.
E(Y ) = E(1
n
n∑i=1
Yi) =1
n
n∑i=1
E(Yi) =1
n
n∑i=1
µ =1
nnµ = µ.
Example 3.2. Let us calculate the variance of the estimator Y for µ:
V ar(Y ) = V ar(1
n
n∑i=1
Yi) =1
n2
n∑i=1
V ar(Yi) =1
n2nσ2 =
σ2
n.
3.3 Large sample properties
The last example showed that the variance of an estimator can depend on the sample size n.
This already hints towards the importance of considering properties of an estimator if the
sample size tends towards infinity. To this aim, we define another important quality criteria,
which broadly speaking ensures that if we were to have infinitely many observations at hand,
we would indeed obtain the true value.
Definition 3.2. (Consistency)
Let θn be an estimator of θ0, which makes use of the random variables Y1, ..., Yn. Then, θnis said to be a consistent estimator for θ0 if for every ε < 0
P (|θn − θ0|> ε)→ 0 if n→∞.
To introduce a concise notation: The estimator is consistent iff
plimn→∞(θn) = θ.
Remark 1. A consistent estimator can be though of as an unbiased estimator whose variance
shrinks to zero with increasing sample size.
Our above considered estimator of the sample average for the mean is a consistent esti-
mator. This result is so important that it is in fact a theorem with a well-known name:
16
Figure 5: The sampling distributions of a consistent estimator for three sample sizes (Source:Wooldridge, Figure C.3)
Theorem 3.1. (Law of large numbers)
Let Y1, Y2, ..., Yn be i.i.d. random variables with mean µ. Then,
plimn→∞(Yn) = µ.
Proof. (Idea only)
We have shown that V ar(Yn) =σ
n, hence V ar(Yn)→ 0 for n→∞.
A more detailed proof would require Chebychef’s inequality, you can look it up on wikipedia.
Note that the theorem does not require Yi to have finite variance, so another proof is necessary
in case of infinite variance. This more evolved form can also be found on wikipedia.
Remark 2. The above is the so-called weak law of large number. There is also the strong
law, which ascertains almost sure convergence. The difference is the following: the weak law
allows for P (|Yn − µ|> ε) infinitely many times, but at irregular intervals. The strong law
shows that for every ε there is an n0 so that P (|Yn − µ|< ε) for all n > n0.
The following property of the plim allows the application of the law of large numbers to
a wide variety of problems:
Theorem 3.2. (Continuous functions and the plim)
Let g be a continuous function and θ0 be a parameter which we estimate with the consistent
estimator θn, i.e. plim(θn) = θ0. Then we can consistently estimate the parameter g(θ0)
with the estimator Gn := g(θn), i.e. plim(Gn) = plim(g(θn)) = g(plim(θn)) = g(θ0).
Consistency is an important property which tells us that the distribution of the estimator
gets more and more concentrated around the parameter as we increase sample size. Indeed,
it is possible to know even more about the distribution of the estimator and this is important
for interval estimation (such as confidence intervals) and hypothesis testing.
17
Definition 3.3. (Asymptotic normal distribution)
If {Y1, Y2, ....} is an infinite sequence of random variables such that for all z ∈ R we have
that
P (Yn ≤ z)→ Φ(z) for n→∞,
where Φ(z) is the standard normal distribution function, then, Yn is said to be asymptotically
normally distributed, Yn ∼a N (0, 1).
Let us look one final time at our estimator of the sample average for the mean. The
last result of our lecture states that a linear transformation of this estimator is indeed
asymptotically normally distributed. This is a fundamental result and deserves its own
name:
Theorem 3.3. (Central limit theorem)
Let {Y1, ..., Yn} be a random sample with mean µ and variance σ2. Then,
Zn :=Yn − µσ/√n∼a N (0, 1).
There is a pretty good app which simulates random draws and the sampling distributions
of different estimators. I recommend you play around with this for a few minutes to get a feel-
ing for the central limit theorem: http://onlinestatbook.com/stat sim/sampling dist/index.html
18