Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of...

77
Chapter 1 Review

Transcript of Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of...

Page 1: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Chapter 1 Review

Page 2: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Analyzing Categorical Data

• Categorical Variables place individuals into one of several groups or categories– The values of a categorical variable are labels for the different categories

– The distribution of a categorical variable lists the count or percent of individuals who fall into each category.

Frequency Table

Format Count of Stations

Adult Contemporary 1556

Adult Standards 1196

Contemporary Hit 569

Country 2066

News/Talk 2179

Oldies 1060

Religious 2014

Rock 869

Spanish Language 750

Other Formats 1579

Total 13838

Relative Frequency Table

Format Percent of Stations

Adult Contemporary 11.2

Adult Standards 8.6

Contemporary Hit 4.1

Country 14.9

News/Talk 15.7

Oldies 7.7

Religious 14.6

Rock 6.3

Spanish Language 5.4

Other Formats 11.4

Total 99.9

Example, page 8

Count

Percent

Variable

Values

Page 3: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Analyzing Categorical Data

• Two-Way Tables and Marginal DistributionsWhen a dataset involves two categorical variables, we

begin by examining the counts or percents in various categories for one of the variables.

Definition:

Two-way Table – describes two categorical variables, organizing counts according to a row variable and a column variable.

Young adults by gender and chance of getting rich

Female Male Total

Almost no chance 96 98 194

Some chance, but probably not 426 286 712

A 50-50 chance 696 720 1416

A good chance 663 758 1421

Almost certain 486 597 1083

Total 2367 2459 4826

Example, p. 12

What are the variables described by this two-way table?How many young adults were surveyed?

Page 4: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Analyzing Categorical Data

• Two-Way Tables and Marginal Distributions

Definition:

The Marginal Distribution of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table.

Note: Percents are often more informative than counts, especially when comparing groups of different sizes.

To examine a marginal distribution,1)Use the data in the table to calculate the marginal distribution (in percents) of the row or column totals.2)Make a graph to display the marginal distribution.

Page 5: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Analyzing Categorical Data

• Relationships Between Categorical Variables

• Marginal distributions tell us nothing about the relationship between two variables.

Definition:

A Conditional Distribution of a variable describes the values of that variable among individuals who have a specific value of another variable.

To examine or compare conditional distributions,1)Select the row(s) or column(s) of interest.2)Use the data in the table to calculate the conditional distribution (in percents) of the row(s) or column(s).3)Make a graph to display the conditional distribution.

• Use a side-by-side bar graph or segmented bar graph to compare distributions.

Page 6: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Young adults by gender and chance of getting rich

Female Male Total

Almost no chance 96 98 194

Some chance, but probably not 426 286 712

A 50-50 chance 696 720 1416

A good chance 663 758 1421

Almost certain 486 597 1083

Total 2367 2459 4826

Analyzing Categorical Data

• Two-Way Tables and Conditional Distributions

Response Male

Almost no chance 98/2459 = 4.0%

Some chance 286/2459 = 11.6%

A 50-50 chance 720/2459 = 29.3%

A good chance 758/2459 = 30.8%

Almost certain 597/2459 = 24.3%

Example, p. 15

Calculate the conditional distribution of opinion among males.Examine the relationship between gender and opinion.

Almost no chance

Some chance

50-50 chance

Good chance

Almost certain

0

10

20

30

40

Chance of being wealthy by age 30

Males

Series2

Opinion

Perc

ent

Female

96/2367 = 4.1%

426/2367 = 18.0%

696/2367 = 29.4%

663/2367 = 28.0%

486/2367 = 20.5%

Almost no chance

Some chance

50-50 chance

Good chance

Almost certain

0

10

20

30

40

Chance of being wealthy by age 30

Males

Females

Opinion

Perc

ent

Males Females0%

10%20%30%40%50%60%70%80%90%

100%

Chance of being wealthy by age 30

Almost certain

Good chance

50-50 chance

Some chance

Almost no chance

Opinion

Perc

ent

Page 7: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

1)Draw a horizontal axis (a number line) and label it with the variable name.2)Scale the axis from the minimum to the maximum value.3)Mark a dot above the location on the horizontal axis corresponding to each data value.

Displaying Q

uantitative Data

• Dotplots– One of the simplest graphs to construct and interpret is a

dotplot. Each data value is shown as a dot above its location on a number line.

How to Make a Dotplot

Number of Goals Scored Per Game by the 2004 US Women’s Soccer Team

3 0 2 7 8 2 4 3 5 1 1 4 5 3 1 1 3

3 3 2 1 2 2 2 4 3 5 6 1 5 5 1 1 5

Page 8: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

• Examining the Distribution of a Quantitative Variable

• The purpose of a graph is to help us understand the data. After you make a graph, always ask, “What do I see?”

In any graph, look for the overall pattern and for striking departures from that pattern.

Describe the overall pattern of a distribution by its:

• Shape

• Center

• Spread

Note individual values that fall outside the overall pattern. These departures are called outliers.

How to Examine the Distribution of a Quantitative Variable

Displaying Q

uantitative DataDon’t forget your

SOCS!

Page 9: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Displaying Q

uantitative Data

• Describing Shape– When you describe a distribution’s shape, concentrate on the

main features. Look for rough symmetry or clear skewness.

Definitions:

A distribution is roughly symmetric if the right and left sides of the graph are approximately mirror images of each other.

A distribution is skewed to the right (right-skewed) if the right side of the graph (containing the half of the observations with larger values) is much longer than the left side.

It is skewed to the left (left-skewed) if the left side of the graph is much longer than the right side.

DiceRolls0 2 4 6 8 10 12

Collection 1 Dot Plot

Score70 75 80 85 90 95 100

Collection 1 Dot Plot

Siblings0 1 2 3 4 5 6 7

Collection 1 Dot PlotSymmetric Skewed-left Skewed-right

Page 10: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

1)Separate each observation into a stem (all but the final digit) and a leaf (the final digit).

2)Write all possible stems from the smallest to the largest in a vertical column and draw a vertical line to the right of the column.

3)Write each leaf in the row to the right of its stem.

4)Arrange the leaves in increasing order out from the stem.

5)Provide a key that explains in context what the stems and leaves represent.

Displaying Q

uantitative Data

• Stemplots (Stem-and-Leaf Plots)– Another simple graphical display for small data sets is a

stemplot. Stemplots give us a quick picture of the distribution while including the actual numerical values.

How to Make a Stemplot

Page 11: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Displaying Q

uantitative Data

• Splitting Stems and Back-to-Back Stemplots

– When data values are “bunched up”, we can get a better picture of the distribution by splitting stems.

– Two distributions of the same quantitative variable can be compared using a back-to-back stemplot with common stems.

50 26 26 31 57 19 24 22 23 38

13 50 13 34 23 30 49 13 15 51

001122334455

Key: 4|9 represents a student who reported having 49 pairs of shoes.

Females14 7 6 5 12 38 8 7 10 10

10 11 4 5 22 7 5 10 35 7

Males0 40 5556777781 000012412 2233 584455

Females

33395

433266

4108

9100

7

Males

“split stems”

Page 12: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

1)Divide the range of data into classes of equal width.

2)Find the count (frequency) or percent (relative frequency) of individuals in each class.

3)Label and scale your axes and draw the histogram. The height of the bar equals its frequency. Adjacent bars should touch, unless a class contains no individuals.

Displaying Q

uantitative Data

• Histograms– Quantitative variables often take many values. A graph of the

distribution may be clearer if nearby values are grouped together.

– The most common graph of the distribution of one quantitative variable is a histogram.

How to Make a Histogram

Page 13: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

• Making a Histogram• The table on page 35 presents data on the percent of residents from each state

who were born outside of the U.S.

Displaying Q

uantitative Data

Example, page 35

Frequency Table

Class Count

0 to <5 20

5 to <10 13

10 to <15 9

15 to <20 5

20 to <25 2

25 to <30 1

Total 50Percent of foreign-born residents

Nu

mb

er o

f S

tate

s

2

4

6

8

10

12

14

16

18

20

22

PercentOfForeignBornResidents0 5 10 15 20 25 30

Collection 3 Histogram

Page 14: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

1)Don’t confuse histograms and bar graphs.

2)Don’t use counts (in a frequency table) or percents (in a relative frequency table) as data.

3)Use percents instead of counts on the vertical axis when comparing distributions with different numbers of observations.

4)Just because a graph looks nice, it’s not necessarily a meaningful display of data.

Displaying Q

uantitative Data

• Using Histograms Wisely– Here are several cautions based on common mistakes students

make when using histograms.

Cautions

Page 15: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Describing Q

uantitative Data

• Measuring Center: The Mean– The most common measure of center is the ordinary

arithmetic average, or mean.

Definition:

To find the mean (pronounced “x-bar”) of a set of observations, add their values and divide by the number of observations. If the n observations are x1, x2, x3, …, xn, their mean is:

x

x sum of observations

n

x1 x2 ... xn

n

In mathematics, the capital Greek letter Σis short for “add them all up.” Therefore, the formula for the mean can be written in more compact notation:

x xi

n

Page 16: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Describing Q

uantitative Data

• Measuring Center: The Median– Another common measure of center is the median. In section

1.2, we learned that the median describes the midpoint of a distribution.

Definition:

The median M is the midpoint of a distribution, the number such that half of the observations are smaller and the other half are larger.

To find the median of a distribution:

1)Arrange all observations from smallest to largest.

2)If the number of observations n is odd, the median M is the center observation in the ordered list.

3)If the number of observations n is even, the median M is the average of the two center observations in the ordered list.

Page 17: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

• Comparing the Mean and the Median

• The mean and median measure center in different ways, and both are useful. – Don’t confuse the “average” value of a variable (the mean) with

its “typical” value, which we might describe by the median.

The mean and median of a roughly symmetric distribution are close together.

If the distribution is exactly symmetric, the mean and median are exactly the same.

In a skewed distribution, the mean is usually farther out in the long tail than is the median.

Comparing the Mean and the Median

Describing Q

uantitative Data

Page 18: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Describing Q

uantitative Data

• Measuring Spread: The Interquartile Range (IQR)– A measure of center alone can be misleading.

– A useful numerical description of a distribution requires both a measure of center and a measure of spread.

To calculate the quartiles:

1)Arrange the observations in increasing order and locate the median M.

2)The first quartile Q1 is the median of the observations located to the left of the median in the ordered list.

3)The third quartile Q3 is the median of the observations located to the right of the median in the ordered list.

The interquartile range (IQR) is defined as:

IQR = Q3 – Q1

How to Calculate the Quartiles and the Interquartile Range

Page 19: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

5 10 10 15 15 15 15 20 20 20 25 30 30 40 40 45 60 60 65 85

Describing Q

uantitative Data

• Find and Interpret the IQR

Example, page 57

10 30 5 25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45

Travel times to work for 20 randomly selected New Yorkers

5 10 10 15 15 15 15 20 20 20 25 30 30 40 40 45 60 60 65 85

M = 22.5 Q3= 42.5Q1 = 15

IQR = Q3 – Q1

= 42.5 – 15= 27.5 minutes

Interpretation: The range of the middle half of travel times for the New Yorkers in the sample is 27.5 minutes.

Page 20: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Describing Q

uantitative Data

• Identifying Outliers– In addition to serving as a measure of spread, the interquartile

range (IQR) is used as part of a rule of thumb for identifying outliers.

Definition:

The 1.5 x IQR Rule for Outliers

Call an observation an outlier if it falls more than 1.5 x IQR above the third quartile or below the first quartile.

Example, page 57

In the New York travel time data, we found Q1=15 minutes, Q3=42.5 minutes, and IQR=27.5 minutes.

For these data, 1.5 x IQR = 1.5(27.5) = 41.25

Q1 - 1.5 x IQR = 15 – 41.25 = -26.25

Q3+ 1.5 x IQR = 42.5 + 41.25 = 83.75

Any travel time shorter than -26.25 minutes or longer than 83.75 minutes is considered an outlier.

0 51 0055552 00053 004 00556 00578 5

Page 21: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

• The Five-Number Summary

• The minimum and maximum values alone tell us little about the distribution as a whole. Likewise, the median and quartiles tell us little about the tails of a distribution.

• To get a quick summary of both center and spread, combine all five numbers.

Describing Quantitative

Data

Definition:

The five-number summary of a distribution consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation, written in order from smallest to largest.

Minimum Q1 M Q3 Maximum

Page 22: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

• Boxplots (Box-and-Whisker Plots)• The five-number summary divides the distribution roughly

into quarters. This leads to a new way to display quantitative data, the boxplot.

•Draw and label a number line that includes the range of the distribution.

•Draw a central box from Q1 to Q3.

•Note the median M inside the box.

•Extend lines (whiskers) from the box out to the minimum and maximum values that are not outliers.

How to Make a Boxplot

Describing Q

uantitative Data

Page 23: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Describing Q

uantitative Data

• Measuring Spread: The Standard Deviation– The most common measure of spread looks at how far each

observation is from the mean. This measure is called the standard deviation. Let’s explore it!

– Consider the following data on the number of pets owned by a group of 9 children.

NumberOfPets0 2 4 6 8 10

Collection 5 Dot Plot

1) Calculate the mean.

2) Calculate each deviation.deviation = observation – mean

= 5

x

deviation: 1 - 5 = -4

deviation: 8 - 5 = 3

Page 24: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Describing Q

uantitative Data

• Measuring Spread: The Standard Deviation

NumberOfPets0 2 4 6 8 10

Collection 5 Dot Plot xi (xi-mean) (xi-mean)2

1 1 - 5 = -4 (-4)2 = 16

3 3 - 5 = -2 (-2)2 = 4

4 4 - 5 = -1 (-1)2 = 1

4 4 - 5 = -1 (-1)2 = 1

4 4 - 5 = -1 (-1)2 = 1

5 5 - 5 = 0 (0)2 = 0

7 7 - 5 = 2 (2)2 = 4

8 8 - 5 = 3 (3)2 = 9

9 9 - 5 = 4 (4)2 = 16

Sum=? Sum=?

3) Square each deviation.

4) Find the “average” squared deviation. Calculate the sum of the squared deviations divided by (n-1)…this is called the variance.

5) Calculate the square root of the variance…this is the standard deviation.

“average” squared deviation = 52/(9-1) = 6.5 This is the variance.

Standard deviation = square root of variance =

6.5 2.55

Page 25: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Describing Q

uantitative Data

• Measuring Spread: The Standard Deviation

Definition:

The standard deviation sx measures the average distance of the observations from their mean. It is calculated by finding an average of the squared distances and then taking the square root. This average squared distance is called the variance.

variance = sx2

(x1 x )2 (x2 x )2 ... (xn x )2

n 1

1

n 1(x i x )2

standard deviation = sx 1

n 1(x i x )2

Page 26: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

• Choosing Measures of Center and Spread

• We now have a choice between two descriptions for center and spread– Mean and Standard Deviation

– Median and Interquartile Range

•The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers.

•Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers.

•NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA!

Choosing Measures of Center and Spread

Describing Q

uantitative Data

Page 27: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Chapter 2 Review

Page 28: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Describing Location in a D

istribution• Measuring Position: Percentiles

– One way to describe the location of a value in a distribution is to tell what percent of observations are less than it.

Definition:

The pth percentile of a distribution is the value with p percent of the observations less than it.

6 7

7 2334

7 5777899

8 00123334

8 569

9 03

Jenny earned a score of 86 on her test. How did she perform relative to the rest of the class?

Example, p. 85

Her score was greater than 21 of the 25 observations. Since 21 of the 25, or 84%, of the scores are below hers, Jenny is at the 84th percentile in the class’s test score distribution.

6 7

7 2334

7 5777899

8 00123334

8 569

9 03

Page 29: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Describing Location in a D

istributionUse the graph from page 88 to answer the following questions.

• Was Barack Obama, who was inaugurated at age 47, unusually young?

• Estimate and interpret the 65th percentile of the distribution

Interpreting Cumulative Relative Frequency Graphs

47

11

65

58

Page 30: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Describing Location in a D

istribution• Measuring Position: z-Scores

– A z-score tells us how many standard deviations from the mean an observation falls, and in what direction.

Definition:

If x is an observation from a distribution that has known mean and standard deviation, the standardized value of x is:

A standardized value is often called a z-score.

z x mean

standard deviation

Jenny earned a score of 86 on her test. The class mean is 80 and the standard deviation is 6.07. What is her standardized score?

z x mean

standard deviation

86 80

6.070.99

Page 31: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Describing Location in a D

istribution• Using z-scores for ComparisonWe can use z-scores to compare the position of individuals in different distributions.

Jenny earned a score of 86 on her statistics test. The class mean was 80 and the standard deviation was 6.07. She earned a score of 82 on her chemistry test. The chemistry scores had a fairly symmetric distribution with a mean 76 and standard deviation of 4. On which test did Jenny perform better relative to the rest of her class?

Example, p. 91

zstats 86 80

6.07

zstats 0.99

zchem 82 76

4

zchem 1.5

Page 32: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Example, p. 93

Describing Location in a D

istribution• Transforming Data

Transforming converts the original observations from the original units of measurements to another scale. Transformations can affect the shape, center, and spread of a distribution.

Adding the same number a (either positive, zero, or negative) to each observation:

• adds a to measures of center and location (mean, median, quartiles, percentiles), but

• Does not change the shape of the distribution or measures of spread (range, IQR, standard deviation).

Effect of Adding (or Subracting) a Constant

n Mean sx Min Q1 M Q3 Max IQR Range

Guess(m) 44 16.02 7.14 8 11 15 17 40 6 32

Error (m) 44 3.02 7.14 -5 -2 2 4 27 6 32

Page 33: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Example, p. 95

Describing Location in a D

istribution• Transforming Data

Multiplying (or dividing) each observation by the same number b (positive, negative, or zero):

• multiplies (divides) measures of center and location by b

• multiplies (divides) measures of spread by |b|, but

• does not change the shape of the distribution

Effect of Multiplying (or Dividing) by a Constant

n Mean sx Min Q1 M Q3 Max IQR Range

Error(ft) 44 9.91 23.43 -16.4 -6.56 6.56 13.12 88.56 19.68 104.96

Error (m) 44 3.02 7.14 -5 -2 2 4 27 6 32

Page 34: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Describing Location in a D

istribution• Density Curve

Definition:

A density curve is a curve that• is always on or above the horizontal axis, and• has area exactly 1 underneath it.

A density curve describes the overall pattern of a distribution. The area under the curve and above any interval of values on the horizontal axis is the proportion of all observations that fall in that interval.

The overall pattern of this histogram of the scores of all 947 seventh-grade students in Gary, Indiana, on the vocabulary part of the Iowa Test of Basic Skills (ITBS) can be described by a smooth curve drawn through the tops of the bars.

Page 35: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

• Normal Distributions Norm

al Distributions

Definition:

A Normal distribution is described by a Normal density curve. Any particular Normal distribution is completely specified by two numbers: its mean µ and standard deviation σ.

• The mean of a Normal distribution is the center of the symmetric Normal curve.

• The standard deviation is the distance from the center to the change-of-curvature points on either side.

• We abbreviate the Normal distribution with mean µ and standard deviation σ as N(µ,σ).

Normal distributions are good descriptions for some distributions of real data.

Normal distributions are good approximations of the results of many kinds of chance outcomes.

Many statistical inference procedures are based on Normal distributions.

Page 36: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Norm

al Distributions

Although there are many Normal curves, they all have properties in common.

The 68-95-99.7 Rule

Definition: The 68-95-99.7 Rule (“The Empirical Rule”)

In the Normal distribution with mean µ and standard deviation σ:

• Approximately 68% of the observations fall within σ of µ.

• Approximately 95% of the observations fall within 2σ of µ.

• Approximately 99.7% of the observations fall within 3σ of µ.

Page 37: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Norm

al Distributions

The distribution of Iowa Test of Basic Skills (ITBS) vocabulary scores for 7th grade students in Gary, Indiana, is close to Normal. Suppose the distribution is N(6.84, 1.55).

a) Sketch the Normal density curve for this distribution.b) What percent of ITBS vocabulary scores are less than 3.74?c) What percent of the scores are between 5.29 and 9.94?

Example, p. 113

Page 38: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Norm

al Distributions

• The Standard Normal Distribution– All Normal distributions are the same if we measure in units of

size σ from the mean µ as center.

Definition:

The standard Normal distribution is the Normal distribution with mean 0 and standard deviation 1.If a variable x has any Normal distribution N(µ,σ) with mean µ and standard deviation σ, then the standardized variable

has the standard Normal distribution, N(0,1).

z x -

Page 39: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Norm

al Distributions

• The Standard Normal TableBecause all Normal distributions are the same when we standardize, we can find areas under any Normal curve from a single table.

Definition: The Standard Normal Table

Table A is a table of areas under the standard Normal curve. The table entry for each value z is the area under the curve to the left of z.

Z .00 .01 .02

0.7 .7580 .7611 .7642

0.8 .7881 .7910 .7939

0.9 .8159 .8186 .8212

P(z < 0.81) = .7910

Suppose we want to find the proportion of observations from the standard Normal distribution that are less than 0.81. We can use Table A:

Page 40: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Norm

al Distributions

• Normal Distribution CalculationsWhen Tiger Woods hits his driver, the distance the ball travels can be described by N(304, 8). What percent of Tiger’s drives travel between 305 and 325 yards?

When x = 305, z =305 - 304

80.13

When x = 325, z =325 - 304

82.63

Using Table A, we can find the area to the left of z=2.63 and the area to the left of z=0.13.0.9957 – 0.5517 = 0.4440. About 44% of Tiger’s drives travel between 305 and 325 yards.

Page 41: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Norm

al Distributions

• Assessing Normality

• The Normal distributions provide good models for some distributions of real data. Many statistical inference procedures are based on the assumption that the population is approximately Normally distributed. Consequently, we need a strategy for assessing Normality.

Plot the data.

• Make a dotplot, stemplot, or histogram and see if the graph is approximately symmetric and bell-shaped.

Check whether the data follow the 68-95-99.7 rule.

• Count how many observations fall within one, two, and three standard deviations of the mean and check to see if these percents are close to the 68%, 95%, and 99.7% targets for a Normal distribution.

Page 42: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Norm

al Distributions

• Normal Probability Plots• Most software packages can construct Normal probability plots. These

plots are constructed by plotting each observation in a data set against its corresponding percentile’s z-score.

If the points on a Normal probability plot lie close to a straight line, the plot indicates that the data are Normal. Systematic deviations from a straight line indicate a non-Normal distribution. Outliers appear as points that are far away from the overall pattern of the plot.

Interpreting Normal Probability Plots

Page 43: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Chapter 3 Review

Page 44: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Scatterplots and Correlation

• Explanatory and Response VariablesMost statistical studies examine data on more than one variable.

In many of these settings, the two variables play different roles.

Definition:

A response variable measures an outcome of a study. An explanatory variable may help explain or influence changes in a response variable.

Note: In many studies, the goal is to show that changes in one or more explanatory variables actually cause changes in a response variable. However, other explanatory-response relationships don’t involve direct causation.

Page 45: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Scatterplots and Correlation

• Displaying Relationships: ScatterplotsThe most useful graph for displaying the relationship between

two quantitative variables is a scatterplot.

Definition:

A scatterplot shows the relationship between two quantitative variables measured on the same individuals. The values of one variable appear on the horizontal axis, and the values of the other variable appear on the vertical axis. Each individual in the data appears as a point on the graph.

1. Decide which variable should go on each axis.

• Remember, the eXplanatory variable goes on the X-axis!

2. Label and scale your axes.

3. Plot individual data values.

How to Make a Scatterplot

Page 46: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Scatterplots and Correlation

• Interpreting ScatterplotsTo interpret a scatterplot, follow the basic strategy of data

analysis from Chapters 1 and 2. Look for patterns and important departures from those patterns.

As in any graph of data, look for the overall pattern and for striking departures from that pattern.

• You can describe the overall pattern of a scatterplot by the direction, form, and strength of the relationship.

• An important kind of departure is an outlier, an individual value that falls outside the overall pattern of the relationship.

• Remember DOFS

How to Examine a Scatterplot

Page 47: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Scatterplots and Correlation

• Measuring Linear Association: Correlation

A scatterplot displays the strength, direction, and form of the relationship between two quantitative variables.

Linear relationships are important because a straight line is a simple pattern that is quite common. Unfortunately, our eyes are not good judges of how strong a linear relationship is.

Definition:

The correlation r measures the strength of the linear relationship between two quantitative variables.

• r is always a number between -1 and 1

• r > 0 indicates a positive association.

• r < 0 indicates a negative association.

• Values of r near 0 indicate a very weak linear relationship.

• The strength of the linear relationship increases as r moves away from 0 towards -1 or 1.

• The extreme values r = -1 and r = 1 occur only in the case of a perfect linear relationship.

Page 48: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Scatterplots and Correlation

• Facts about CorrelationHow correlation behaves is more important than the details of

the formula. Here are some important facts about r.

1. Correlation makes no distinction between explanatory and response variables.

2. r does not change when we change the units of measurement of x, y, or both.

3. The correlation r itself has no unit of measurement.

Cautions:• Correlation requires that both variables be quantitative.

• Correlation does not describe curved relationships between variables, no matter how strong the relationship is.

• Correlation is not resistant. r is strongly affected by a few outlying observations.

• Correlation is not a complete summary of two-variable data.

Page 49: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Least-Squares Regression• Interpreting a Regression Line

A regression line is a model for the data, much like density curves. The equation of a regression line gives a compact mathematical description of what this model tells us about the relationship between the response variable y and the explanatory variable x.

Definition:

Suppose that y is a response variable (plotted on the vertical axis) and x is an explanatory variable (plotted on the horizontal axis). A regression line relating y to x has an equation of the form

ŷ = a + bxIn this equation,

•ŷ (read “y hat”) is the predicted value of the response variable y for a given value of the explanatory variable x.

•b is the slope, the amount by which y is predicted to change when x increases by one unit.

•a is the y intercept, the predicted value of y when x = 0.

Page 50: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Least-Squares Regression• Interpreting a Regression Line

Consider the regression line from the example “Does Fidgeting Keep You Slim?” Identify the slope and y-intercept and interpret each value in context.

The y-intercept a = 3.505 kg is the fat gain estimated by this model if NEA does not change when a person overeats.

The slope b = -0.00344 tells us that the amount of fat gained is predicted to go down by 0.00344 kg for each added calorie of NEA.

fatgain = 3.505 - 0.00344(NEA change)

Page 51: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Least-Squares Regression• Prediction

We can use a regression line to predict the response ŷ for a specific value of the explanatory variable x.

Use the NEA and fat gain regression line to predict the fat gain for a person whose NEA increases by 400 cal when she overeats.

fatgain = 3.505 - 0.00344(NEA change)

fatgain = 3.505 - 0.00344(400)

fatgain = 2.13

We predict a fat gain of 2.13 kg when a person with NEA = 400 calories.

Page 52: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Least-Squares Regression• Extrapolation

We can use a regression line to predict the response ŷ for a specific value of the explanatory variable x. The accuracy of the prediction depends on how much the data scatter about the line.

While we can substitute any value of x into the equation of the regression line, we must exercise caution in making predictions outside the observed values of x.

Definition:

Extrapolation is the use of a regression line for prediction far outside the interval of values of the explanatory variable x used to obtain the line. Such predictions are often not accurate.

Don’t make predictions using values of x that are much larger or much smaller than those that actually appear in your data.

Page 53: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Least-Squares Regression• ResidualsIn most cases, no line will pass exactly through all the points in a scatterplot. A

good regression line makes the vertical distances of the points from the line as small as possible.

Definition:

A residual is the difference between an observed value of the response variable and the value predicted by the regression line. That is,

residual = observed y – predicted y

residual = y - ŷ

residual

Positive residuals(above line)

Negative residuals(below line)

Page 54: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Least-Squares Regression• Least-Squares Regression Line

We can use technology to find the equation of the least-squares regression line. We can also write it in terms of the means and standard deviations of the two variables and their correlation.

Definition: Equation of the least-squares regression line

We have data on an explanatory variable x and a response variable y for n individuals. From the data, calculate the means and standard deviations of the two variables and their correlation. The least squares regression line is the line ŷ = a + bx with

slope

and y intercept

b rsy

sx

a y bx

Page 55: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Least-Squares Regression• Interpreting Residual Plots

A residual plot magnifies the deviations of the points from the line, making it easier to see unusual observations and patterns. 1) The residual plot should show no obvious patterns2) The residuals should be relatively small in size.

Definition:

If we use a least-squares regression line to predict the values of a response variable y from an explanatory variable x, the standard deviation of the residuals (s) is given by

s residuals 2

n 2

(y i ˆ y )2n 2

Pattern in residualsLinear model not

appropriate

Page 56: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

• Interpreting Computer Regression OutputA number of statistical software packages produce similar

regression output. Be sure you can locate

• the slope b,

• the y intercept a,

• and the values of s and r2 .

Least-Squares Regression

Page 57: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Least-Squares Regression• Correlation and Regression Wisdom

2. Correlation and regression lines describe only linear relationships.

3. Correlation and least-squares regression lines are not resistant.

Definition:

An outlier is an observation that lies outside the overall pattern of the other observations. Points that are outliers in the y direction but not the x direction of a scatterplot have large residuals. Other outliers may not have large residuals.

An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation. Points that are outliers in the x direction of a scatterplot are often influential for the least-squares regression line.

Page 58: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Chapter 4 Review

Page 59: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Sampling and Surveys

• Population and Sample

The distinction between population and sample is basic to statistics. To make sense of any sample result, you must know what population the sample representsDefinition:

The population in a statistical study is the entire group of individuals about which we want information.

A sample is the part of the population from which we actually collect information. We use information from a sample to draw conclusions about the entire population.

Population

Sample

Collect data from a representative Sample...

Make an Inference about the Population.

Page 60: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Sampling and Surveys

• The Idea of a Sample SurveyWe often draw conclusions about a whole population on

the basis of a sample.Choosing a sample from a large, varied population is not

that easy.

Step 1: Define the population we want to describe.

Step 2: Say exactly what we want to measure.

A “sample survey” is a study that uses an organized plan to choose a sample that represents some specific population.

Step 3: Decide how to choose a sample from the population.

Page 61: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Sampling and Surveys

• How to Sample BadlyHow can we choose a sample that we can trust to

represent the population? There are a number of different methods to select samples.

Definition:Choosing individuals who are easiest to reach results in a convenience sample.

Definition:The design of a statistical study shows bias if it systematically favors certain outcomes.

Convenience samples often produce unrepresentative data…why?

Page 62: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Sampling and Surveys

• How to Sample Badly

• Convenience samples are almost guaranteed to show bias. So are voluntary response samples, in which people decide whether to join the sample in response to an open invitation.

Definition:A voluntary response sample consists of people who choose themselves by responding to a general appeal. Voluntary response samples show bias because people with strong opinions (often in the same direction) are most likely to respond.

Page 63: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Sampling and Surveys

• How to Sample Well: Random Sampling

• The statistician’s remedy is to allow impersonal chance to choose the sample. A sample chosen by chance rules out both favoritism by the sampler and self-selection by respondents.

• Random sampling, the use of chance to select a sample, is the central principle of statistical sampling.

Definition:A simple random sample (SRS) of size n consists of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected.

In practice, people use random numbers generated by a computer or calculator to choose samples. If you don’t have technology handy, you can use a table of random digits.

Page 64: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Sampling and Surveys

• How to Choose an SRS

Step 1: Label. Give each member of the population a numerical label of the same length.Step 2: Table. Read consecutive groups of digits of the appropriate length from Table D.Your sample contains the individuals whose labels you find.

How to Choose an SRS Using Table D

Definition:A table of random digits is a long string of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 with these properties:

• Each entry in the table is equally likely to be any of the 10 digits 0 - 9.• The entries are independent of each other. That is, knowledge of one part of the table gives no information about any other part.

Page 65: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Sampling and Surveys

• Other Sampling Methods

• The basic idea of sampling is straightforward: take an SRS from the population and use your sample results to gain information about the population. Sometimes there are statistical advantages to using more complex sampling methods.

• One common alternative to an SRS involves sampling important groups (called strata) within the population separately. These “sub-samples” are combined to form one stratified random sample.

Definition:

To select a stratified random sample, first classify the population into groups of similar individuals, called strata. Then choose a separate SRS in each stratum and combine these SRSs to form the full sample.

Page 66: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Sampling and Surveys

• Other Sampling Methods

• Although a stratified random sample can sometimes give more precise information about a population than an SRS, both sampling methods are hard to use when populations are large and spread out over a wide area.

• In that situation, we’d prefer a method that selects groups of individuals that are “near” one another.

Definition:To take a cluster sample, first divide the population into smaller groups. Ideally, these clusters should mirror the characteristics of the population. Then choose an SRS of the clusters. All individuals in the chosen clusters are included in the sample.

Page 67: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Experiments

• Observational Study versus Experiment

In contrast to observational studies, experiments don’t just observe individuals or ask them questions. They actively impose some treatment in order to measure the response.Definition:

An observational study observes individuals and measures variables of interest but does not attempt to influence the responses.

An experiment deliberately imposes some treatment on individuals to measure their responses.

When our goal is to understand cause and effect, experiments are the only source of fully convincing data.

The distinction between observational study and experiment is one of the most important in statistics.

Page 68: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Experiments

• The Language of Experiments

An experiment is a statistical study in which we actually do something (a treatment) to people, animals, or objects (the experimental units) to observe the response. Here is the basic vocabulary of experiments.

Definition:A specific condition applied to the individuals in an experiment is called a treatment. If an experiment has several explanatory variables, a treatment is a combination of specific values of these variables.

The experimental units are the smallest collection of individuals to which treatments are applied. When the units are human beings, they often are called subjects.

Sometimes, the explanatory variables in an experiment are called factors. Many experiments study the joint effects of several factors. In such an experiment, each treatment is formed by combining a specific value (often called a level) of each of the factors.

Page 69: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Experiments

• How to Experiment Well: The Randomized Comparative Experiment

• The remedy for confounding is to perform a comparative experiment in which some units receive one treatment and similar units receive another. Most well designed experiments compare two or more treatments.

• Comparison alone isn’t enough, if the treatments are given to groups that differ greatly, bias will result. The solution to the problem of bias is random assignment.

Definition:In an experiment, random assignment means that experimental units are assigned to treatments at random, that is, using some sort of chance process.

Page 70: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Experiments

• The Randomized Comparative Experiment

Definition:In a completely randomized design, the treatments are assigned to all the experimental units completely by chance.

Some experiments may include a control group that receives an inactive treatment or an existing baseline treatment.

Experimental Units

Random Assignment

Group 1

Group 2

Treatment 1

Treatment 2

Compare Results

Page 71: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Experiments

• Three Principles of Experimental Design

• Randomized comparative experiments are designed to give good evidence that differences in the treatments actually cause the differences we see in the response. 1. Control for lurking variables that might affect the response: Use a

comparative design and ensure that the only systematic difference between the groups is the treatment administered.

2. Random assignment: Use impersonal chance to assign experimental units to treatments. This helps create roughly equivalent groups of experimental units by balancing the effects of lurking variables that aren’t controlled on the treatment groups.

3. Replication: Use enough experimental units in each group so that any differences in the effects of the treatments can be distinguished from chance differences between the groups.

Principles of Experimental Design

Page 72: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Experiments

• Experiments: What Can Go Wrong?

• The logic of a randomized comparative experiment depends on our ability to treat all the subjects the same in every way except for the actual treatments being compared.

• Good experiments, therefore, require careful attention to details to ensure that all subjects really are treated identically.

A response to a dummy treatment is called a placebo effect. The strength of the placebo effect is a strong argument for randomized comparative experiments.

Whenever possible, experiments with human subjects should be double-blind.

Definition:In a double-blind experiment, neither the subjects nor those who interact with them and measure the response variable know which treatment a subject received.

Page 73: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Experiments

• Inference for Experiments

• In an experiment, researchers usually hope to see a difference in the responses so large that it is unlikely to happen just because of chance variation.

• We can use the laws of probability, which describe chance behavior, to learn whether the treatment effects are larger than we would expect to see if only chance were operating.

• If they are, we call them statistically significant. Definition:An observed effect so large that it would rarely occur by chance is called statistically significant.

A statistically significant association in data from a well-designed experiment does imply causation.

Page 74: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Experiments

• Blocking• Completely randomized designs are the simplest statistical designs for

experiments. But just as with sampling, there are times when the simplest method doesn’t yield the most precise results.

Definition

A block is a group of experimental units that are known before the experiment to be similar in some way that is expected to affect the response to the treatments.

In a randomized block design, the random assignment of experimental units to treatments is carried out separately within each block.

Form blocks based on the most important unavoidable sources of variability (lurking variables) among the experimental units.

Randomization will average out the effects of the remaining lurking variables and allow an unbiased comparison of the treatments.

Control what you can, block on what you can’t control, and randomize to create comparable groups.

Page 75: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Experiments

• Matched-Pairs Design• A common type of randomized block design for comparing two treatments

is a matched pairs design. The idea is to create blocks by matching pairs of similar experimental units.

Definition

A matched-pairs design is a randomized blocked experiment in which each block consists of a matching pair of similar experimental units.

Chance is used to determine which unit in each pair gets each treatment.

Sometimes, a “pair” in a matched-pairs design consists of a single unit that receives both treatments. Since the order of the treatments can influence the response, chance is used to determine with treatment is applied first for each unit.

Page 76: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Using Studies W

isely• Scope of Inference

What type of inference can be made from a particular study? The answer depends on the design of the study.

Well-designed experiments randomly assign individuals to treatment groups. However, most experiments don’t select experimental units at random from the larger population. That limits such experiments to inference about cause and effect.

Observational studies don’t randomly assign individuals to groups, which rules out inference about cause and effect. Observational studies that use random sampling can make inferences about the population.

Page 77: Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical.

Using Studies W

isely• Data Ethics

Complex issues of data ethics arise when we collect data from people. Here are some basic standards of data ethics that must be obeyed by all studies that gather data from human subjects, both observational studies and experiments.

• All planned studies must be reviewed in advance by an institutional review board charged with protecting the safety and well-being of the subjects.

• All individuals who are subjects in a study must give their informed consent before data are collected.

• All individual data must be kept confidential. Only statistical summaries for groups of subjects may be made public.

Basic Data Ethics