9주차

77
Introduction to Probability and Statistics 9 th Week (5/10) 1. Descriptive Statistics 2. Sampling Theory

description

 

Transcript of 9주차

Page 1: 9주차

Introduction to Probability and Statistics9th Week (5/10)

1. Descriptive Statistics 2. Sampling Theory

Page 2: 9주차

Probability is a science of ( ).

Statistics is a science of ( ).

Page 3: 9주차

Probability is the Science of Uncertainty.

It is used by Physicists to predict the behaviour of elementary particles.

It is used by engineers to build computers. It is used by economists to predict the behaviour of the

economy. It is used by stockbrokers to make money on the

stockmarket. It is used by psychologists to determine if you should get

that job.

Page 4: 9주차

What about Statistics?

Statistics is the Science of Data.

There are two kinds of statistics

Descriptive Statistics: Discipline of quantitatively describing the main features of a collection of data

Inferential Statistics: It is a discipline that allows us to estimate unknown quantities by making some elementary measurements. Using these estimates we can then make Predictions and Forecast the Future

Page 5: 9주차

Descriptive Statistics

• Describing data with tables and graphs

(quantitative or categorical variables)

• Numerical descriptions of center, variability, position (quantitative variables)

• Bivariate descriptions (In practice, most studies have several variables)

Page 6: 9주차

Frequency distribution: Lists possible values of variable and number of times each occurs

Example: Student survey (n = 60)

“political ideology” measured as ordinal variable with 1 = very liberal, …, 4 = moderate, …, 7 = very conservative

1. Tables and Graphs

Page 7: 9주차
Page 8: 9주차

Histogram: Bar graph of frequencies or percentages

Page 9: 9주차

Shapes of histograms(for quantitative variables)

• Bell-shaped (IQ, SAT, political ideology in all U.S. )• Skewed right (annual income, no. times arrested)• Skewed left (score on easy exam)• Bimodal (polarized opinions)

Page 10: 9주차

Stem-and-leaf plot (John Tukey, 1977)

Example: Exam scores (n = 40 students)

Stem Leaf3 645 376 2358997 0113467789998 001112335688899 02238

Page 11: 9주차

2.Numerical descriptions

Let y denote a quantitative variable, with observations y1 , y2 , y3 , … , yn

a. Describing the center

Median: Middle measurement of ordered sample

Mean: 1 2 ... n iy y y y

yn n

Page 12: 9주차

Properties of mean and median

• For symmetric distributions, mean = median• For skewed distributions, mean is drawn in

direction of longer tail, relative to median• Mean valid for interval scales, median for

interval or ordinal scales• Mean sensitive to “outliers” (median often

preferred for highly skewed distributions)• When distribution symmetric or mildly skewed or

discrete with few values, mean preferred because uses numerical values of observations

Page 13: 9주차

Examples:

• New York Yankees baseball team, 2006 mean salary = $7.0 million median salary = $2.9 million

How possible? Direction of skew?

• Give an example for which you would expect

mean < median

Page 14: 9주차

b. Describing variability

Range: Difference between largest and smallest observations

(but highly sensitive to outliers, insensitive to shape)

Standard deviation: A “typical” distance from the mean

The deviation of observation i from the mean is

iy y

Page 15: 9주차

The variance of the n observations is

The standard deviation s is the square root of the variance,

2 2 22 1( ) ( ) ... ( )

1 1i ny y y y y y

sn n

2s s

Page 16: 9주차

Example: Political ideology• For those in the student sample who attend religious

services at least once a week (n = 9 of the 60), • y = 2, 3, 7, 5, 6, 7, 5, 6, 4

2 2 22

5.0,

(2 5) (3 5) ... (4 5) 243.0

9 1 8

3.0 1.7

y

s

s

For entire sample (n = 60), mean = 3.0, standard deviation = 1.6, tends to have similar variability but be more liberal

Page 17: 9주차

c. Measures of position

pth percentile: p percent of observations below it, (100 - p)% above it.

p = 50: median p = 25: lower quartile (LQ) p = 75: upper quartile (UQ)

Interquartile range IQR = UQ - LQ

Page 18: 9주차

Quartiles portrayed graphically by box plots (John Tukey)

Example: weekly TV watching for n=60 from student survey data file, 3 outliers

Page 19: 9주차

Box plots have box from LQ to UQ, with median marked. They portray a five-number summary of the data:

Minimum, LQ, Median, UQ, Maximum

except for outliers identified separately

Outlier = observation falling

below LQ – 1.5(IQR)

or above UQ + 1.5(IQR)

Ex. If LQ = 2, UQ = 10, then IQR = 8 and outliers above 10 + 1.5(8) = 22

Page 20: 9주차

3. Bivariate description

• Usually we want to study associations between two or more variables (e.g., how does number of close friends depend on gender, income, education, age, working status, rural/urban, religiosity…)

• Response variable: the outcome variable• Explanatory variable(s): defines groups to compare

Ex.: number of close friends is a response variable, while gender, income, … are explanatory variables

Response var. also called “dependent variable”Explanatory var. also called “independent variable”

Page 21: 9주차

Summarizing associations:

• Categorical var’s: show data using contingency tables • Quantitative var’s: show data using scatterplots• Mixture of categorical var. and quantitative var. (e.g.,

number of close friends and gender) can give numerical summaries (mean, standard deviation) or side-by-side box plots for the groups

• Ex. General Social Survey (GSS) data

Men: mean = 7.0, s = 8.4

Women: mean = 5.9, s = 6.0

Shape? Inference questions for later chapters?

Page 22: 9주차

Example: Income by highest degree

Page 23: 9주차

Contingency Tables

• Cross classifications of categorical variables in which rows (typically) represent categories of explanatory variable and columns represent categories of response variable.

• Counts in “cells” of the table give the numbers of individuals at the corresponding combination of levels of the two variables

Page 24: 9주차

Happiness and Family Income (GSS 2008 data: “happy,” “finrela”)

Happiness

Income Very Pretty Not too Total

-------------------------------

Above Aver. 164 233 26 423

Average 293 473 117 883

Below Aver. 132 383 172 687

------------------------------

Total 589 1089 315 1993

Page 25: 9주차

Can summarize by percentages on response variable (happiness)

Example: Percentage “very happy” is

39% for above aver. income (164/423 = 0.39)

33% for average income (293/883 = 0.33)

19% for below average income (??)

Page 26: 9주차

Happiness

Income Very Pretty Not too Total

--------------------------------------------

Above 164 (39%) 233 (55%) 26 (6%) 423

Average 293 (33%) 473 (54%) 117 (13%) 883

Below 132 (19%) 383 (56%) 172 (25%) 687

----------------------------------------------

Inference questions for later chapters? (i.e., what can we conclude about the corresponding population?)

Page 27: 9주차

Scatterplots (for quantitative variables) plot response variable on vertical axis, explanatory variable on horizontal axis

Example: Table 9.13 (p. 294) shows UN data for several nations on many variables, including fertility (births per woman), contraceptive use, literacy, female economic activity, per capita gross domestic product (GDP), cell-phone use, CO2 emissions

Data available at http://www.stat.ufl.edu/~aa/social/data.html

Page 28: 9주차
Page 29: 9주차

Example: Survey in Alachua County, Florida, on predictors of mental health

(data for n = 40 on p. 327 of text and at www.stat.ufl.edu/~aa/social/data.html)

y = measure of mental impairment (incorporates various dimensions of psychiatric symptoms, including aspects of depression and anxiety)

(min = 17, max = 41, mean = 27, s = 5)

x = life events score (events range from severe personal disruptions such as death in family, extramarital affair, to less severe events such as new job, birth of child, moving)

(min = 3, max = 97, mean = 44, s = 23)

Page 30: 9주차
Page 31: 9주차

Bivariate data from 2000 Presidential election

Butterfly ballot, Palm Beach County, FL, text p.290

Page 32: 9주차

Example: The Massachusetts Lottery(data for 37 communities)

Per capita income

% income spent on lottery

Page 33: 9주차

Correlation describes strength of association

• Falls between -1 and +1, with sign indicating direction of association (formula later in Chapter 9)

The larger the correlation in absolute value, the stronger the association (in terms of a straight line trend)

Examples: (positive or negative, how strong?)

Mental impairment and life events, correlation =

GDP and fertility, correlation =

GDP and percent using Internet, correlation =

Page 34: 9주차
Page 35: 9주차

Inferential Statistics: Fortune Teller

How can she read the future? Analysis of Data from Her Previous Victims (Clients) Make Hypotheses Test Them Fool You!

Page 36: 9주차

Population and Sample

- Often in practice we are interested in drawing valid conclusions about a large group of individuals or objects.

- Instead of examining the entire group, called the population, which may be difficult or impossible to do, we may examine only a small part of this population, which is called a sample.

- The process of obtaining samples is called sampling.

population

Sample

Sampling

Page 37: 9주차

Statistical Inference

- We do this with the aim of inferring certain facts about the population from results found in the sample, a process known as statistical inference.

Page 38: 9주차

Sampling With and Without Replacement

- Population may be finite or infinite.

- If finite, Sampling method is important.

- If we draw an object from an urn, we have the choice of replacing or not replacing the object into the urn before we draw again.

- Sampling with replacement: Sampling where each member of a population may be chosen more than once

- Sampling without replacement: sampling where each member cannot be chosen more than once

- A finite population that is sampled with replacement can theoretically be considered infinite since samples of any size can be drawn without exhausting the population.

Page 39: 9주차

Random Samples, Random Numbers

- Clearly, the reliability of conclusions drawn concerning a population depends on whether the sample is properly chosen so as to represent the population sufficiently well, and one of the important problems of statistical inference is just how to choose a sample.

- One way to do this for finite populations is to make sure that each member of the population has the same chance of being in the sample, which is then often called a random sample.

- Random sampling can be accomplished for relatively small populations by drawing lots or, equivalently, by using a table of random numbers specially constructed for such purposes.

- Because inference from sample to population cannot be certain, we must use the language of probability in any statement of conclusions.

Page 40: 9주차

Random Samples, Random Numbers

Page 41: 9주차

Population Parameters

- A population is considered to be known when we know the probability distribution f (x) (probability function or density function) of the associated random variable X.

- If X is a random variable whose values are the heights (or weights) of the 12,000 students, then X has a probability distribution f (x).

- If, for example, X is normally distributed, we say that the population is normally distributed or that we have a normal population. Similarly, if X is binomially distributed, we say that the population is binomially distributed or that we have a binomial population.

- There will be certain quantities that appear in f(x), such as and in the case of the normal distribution or p in the case of the binomial distribution.

- Other quantities such as the median, moments, and skewness can then be determined in terms of these.

Page 42: 9주차

Population Parameters

- All such quantities are often called population parameters.

- When we are given the population so that we know f(x), then the population parameters are also known.

- An important problem arises when the probability distribution f(x) of the population is not known precisely, although we may have some idea of, or at least be able to make some hypothesis concerning, the general behavior of f(x).

- For example, we may have some reason to suppose that a particular population is normally distributed.

- In that case we may not know one or both of the values and so we might wish to draw statistical inferences about them.

Page 43: 9주차

Population Parameters

Page 44: 9주차

Sample Statistics

- We can take random samples from the population and then use these samples to obtain values that serve to estimate and test hypotheses about the population parameters.

Page 45: 9주차

Sample Statistics

Page 46: 9주차

Sample statistics / Population parameters

• We distinguish between summaries of samples (statistics) and summaries of populations (parameters).

• Common to denote statistics by Roman letters, parameters by Greek letters:

Page 47: 9주차

Sample Statistics

- In general, corresponding to each population parameter there will be a statistic to be computed from the sample.

- Usually the method for obtaining this statistic from the sample is similar to that for obtaining the parameter from a finite population, since a sample consists of a finite set of values.

- As we shall see, however, this may not always produce the “best estimate,” and one of the important problems of sampling theory is to decide how to form the proper sample statistic that will best estimate a given population parameter.

- Where possible we shall try to use Greek letters, such as and , for values of population parameters, and Roman letters, m, s, etc., for values of corresponding sample statistics.

Page 48: 9주차

Sampling Distribution

- As we have seen, a sample statistic that is computed from X1, . . . , Xn is a function of these random variables and is therefore itself a random variable.

- The probability distribution of a sample statistic is often called the sampling distribution of the statistic.

- Alternatively we can consider all possible samples of size n that can be drawn from the population, and for each sample we compute the statistic.

- In this manner we obtain the distribution of the statistic, which is its sampling distribution.

- For a sampling distribution, we can of course compute a mean, variance, standard deviation, moments, etc.

- The standard deviation is sometimes also called the standard error.

Page 49: 9주차

The Sample Mean

Page 50: 9주차

Sampling Distribution of Means

Page 51: 9주차

Sampling Distribution of Means

Page 52: 9주차

Sampling Distribution of Means

Page 53: 9주차

Sampling Distribution of Proportions

- Suppose that a population is infinite and binomially distributed, with p and q = 1- p being the respective probabilities that any given member exhibits or does not exhibit a certain property.

- Consider all possible samples of size n drawn from this population, and for each sample determine the statistic that is the proportion P of successes.

- In the case of the coin, p would be the proportion of heads turning up in n tosses. Then we obtain a sampling distribution of proportions whose mean p and standard deviation p are given by

- For finite populations in which sampling is without replacement, the second equation in (9) is replaced by as given by (6) with

Page 54: 9주차

Sampling Distribution of Proportions

Page 55: 9주차

Example 6. Find the probability that in 120 tosses of a fair coin (a) between 40% and 60% will be heads, (b) or more will be heads.

Page 56: 9주차

Example 6. (Continue)

Page 57: 9주차

Sampling Distribution of Differences and Sums

Page 58: 9주차

Sampling Distribution of Differences and Sums

Page 59: 9주차

Example 7. The electric light bulbs of manufacturer A have a mean lifetime of 1400 hours with a standard deviation of 200 hours, while those of manufacturer B have a mean lifetime of 1200 hours with a standard deviation of 100 hours. If random samples of 125 bulbs of each brand are tested, what is the probability that the brand A bulbs will have a mean lifetime that is at least (a) 160 hours, (b) 250 hours more than the brand B bulbs?

Page 60: 9주차

Example 7. (Answer)

Page 61: 9주차

Example 8. Ball bearings of a given brand weigh 0.50 oz with a standard deviation of 0.02 oz. What is the probability that two lots, of 1000 ball bearings each, will differ in weight by more than 2 oz?

Page 62: 9주차

The Sample Variance

Page 63: 9주차

The Sample Variance

Page 64: 9주차

Sampling Distribution of Variances

Page 65: 9주차

(Continuous) Chi Square Distribution

A special gamma distribution α = r/2, β = 2

PD

E(X)

Var(X)

Page 66: 9주차

(Continuous) Chi Square Distribution

Page 67: 9주차

(Continuous) Chi Square Distribution

Page 68: 9주차

Case Where Population Variance Is Unknown

Page 69: 9주차

(Continuous) Student t Distribution

Page 70: 9주차

(Continuous) Student t Distribution

Page 71: 9주차

(Continuous) F Distribution

Page 72: 9주차

(Continuous) F Distribution

Page 73: 9주차

(Continuous) F Distribution ⊙ Effect of the degree of freedom

For α, 100(1- α)% : fα(m, n)

(1) P(X ≥ fα(m, n) ) = α

(2) P(f1-α/2(m, n) ≤ X ≤ fα/2(m, n)) = α

(3) F F (m, n) 1/F F (n, m)∼ ⇒ ∼

Page 74: 9주차

(Continuous) F Distribution

Page 75: 9주차

Sampling Distribution of Ratios of Variances

Page 76: 9주차

Other Statistics

Page 77: 9주차

Other Statistics