STA220 H1F Section L0301: Health & Life Sciences Test October … · 2014-10-23 · 1,2 3,4 5 6...

13
STA220 H1F Section L0301: Health & Life Sciences Test October 23, 2014 LAST NAME: SOLUTIONS FIRST NAME: STUDENT NUMBER: INSTRUCTIONS: Time: 90 minutes Aids allowed: calculator, one-sided 8 1 2 × 11 handwritten aid sheet Complete all questions in pen. Any questions completed in pencil will not be eligible to be remarked if there is a marking error. A table of values from the standard Normal distribution is on the two last pages (pages 14 and 15). Total points: 70 1,2 3,4 5 6 7abc 7de 8ab 8cd 9 10 11abc 11de 1

Transcript of STA220 H1F Section L0301: Health & Life Sciences Test October … · 2014-10-23 · 1,2 3,4 5 6...

Page 1: STA220 H1F Section L0301: Health & Life Sciences Test October … · 2014-10-23 · 1,2 3,4 5 6 7abc 7de 8ab 8cd 9 10 11abc 11de 1. 1. (5 marks) True or False: (a) T If a data value

STA220 H1FSection L0301: Health & Life Sciences

TestOctober 23, 2014

LAST NAME: SOLUTIONS FIRST NAME:

STUDENT NUMBER:

INSTRUCTIONS:

• Time: 90 minutes

• Aids allowed: calculator, one-sided 812 × 11 handwritten aid sheet

• Complete all questions in pen. Any questions completed in pencil will not be eligibleto be remarked if there is a marking error.

• A table of values from the standard Normal distribution is on the two last pages(pages 14 and 15).

• Total points: 70

1,2 3,4 5 6 7abc 7de

8ab 8cd 9 10 11abc 11de

1

Page 2: STA220 H1F Section L0301: Health & Life Sciences Test October … · 2014-10-23 · 1,2 3,4 5 6 7abc 7de 8ab 8cd 9 10 11abc 11de 1. 1. (5 marks) True or False: (a) T If a data value

1. (5 marks) True or False:

(a) T If a data value has a z-score of -1.8 then it must be below the mean.

(b) F A boxplot is useful to show if a distribution is bimodal.

(c) F If the sample mean is 0, then the standard deviation will also be 0.

(d) F One thousand students are randomly selected from all those currentlyregistered at the University of Toronto and they each report how many kilome-tres they rode on the subway during the past week. Since many students did notride the subway at all, a histogram of the values does not look like the shape ofa normal density function. True or False: Since the shape of the histogramis not like the shape of a normal density function, we should not use the normaldistribution for the sampling distribution of the average number of kilometresthat UofT students rode on the subway last week.

(e) F A researcher carries out a survey to determine the proportion of studentswho stayed up all night at least one night during the last exam period. Shecalculates both a 95% confidence interval and a 99% confidence interval for thisproportion from the resulting data. True or False: The 95% confidence in-terval will be wider than the 99% confidence interval.

2. (3 marks) The plot below shows life expectancy on the vertical axis and per capitacigarette consumption (the average number of cigarettes consumed in a year perperson) on the horizontal axis for a simple random sample of 10 countries.

500 1000 1500 2000 2500 3000

4050

6070

Per capita cigarette consumption

Life

exp

ecta

ncy

Which of the following are valid conclu-sions? Circle all that are valid.

• There is a negative correlation be-tween life expectancy and cigaretteconsumption.

• The sample is too small to draw anygeneral conclusions.

• Per capita income (the averageincome per person in a coun-try) is a likely lurking variable.(Only valid conclusion.)

2

Page 3: STA220 H1F Section L0301: Health & Life Sciences Test October … · 2014-10-23 · 1,2 3,4 5 6 7abc 7de 8ab 8cd 9 10 11abc 11de 1. 1. (5 marks) True or False: (a) T If a data value

3. (2 marks) Below is a probability density function for a continuous random variable X.

40 50 60 70 80 90 100

0.005

0.010

0.015

0.020

0.025

0.030

X

Density

Explain how you would use this densityfunction to calculate the following:

(a) P (X = 60)

0

(b) P (X ≤ 50)

The area under the density functionto the left of 50.

4. (4 marks) In this course so far, we’ve discussed these probability distributions:• Bernoulli • Binomial • Discrete Uniform • Continuous Uniform • NormalWhich of these probability distributions would be appropriate in measuring the vari-able of interest in each of the following situations:

(a) the average weight at birth of the next 40 newborns in a hospital

Probabiity distribution: Normal

(b) the number of people in a random sample of size 10 from our class who have abirthday in October (assuming that your birthdays are all independent)

Probabiity distribution: Binomial

(c) the proportion of University of Toronto students in a random sample of size200 whose birthdays are in October (assuming that the birthdays of all UofTstudents are independent)

Probabiity distribution: Normal

(d) the birthday of a randomly selected person

Probabiity distribution: Discrete uniform

3

Page 4: STA220 H1F Section L0301: Health & Life Sciences Test October … · 2014-10-23 · 1,2 3,4 5 6 7abc 7de 8ab 8cd 9 10 11abc 11de 1. 1. (5 marks) True or False: (a) T If a data value

5. Suppose there are 100 hospitals in Ontario that perform appendectomies, a commonsurgery. Each of these hospitals performs several appendectomies per year. Youare interested in estimating the average cost of an appendectomy in Ontario from asample of appendectomies performed in Ontario last year.

(a) (2 marks) Describe how you would use cluster sampling to choose the appen-dectomies in your sample.

Randomly choose a sample of hospitals and take every appendectomy performedlast year in those hospitals to make up your sample of appendectomies.

(b) (2 marks) Describe how you would use stratified sampling to choose the appen-dectomies in your sample.

Within every hospital, randomly select a sample of appendectomies to make upyour sample.

(c) (2 marks) Would cluster sampling or stratified sampling give a better estimateof the average cost on an appendectomy in Ontario. Why?

Stratified sampling would give a better estimate because we expect that thereis more homogeneity in cost of an appendectomy within hospitals than betweenhospitals.

4

Page 5: STA220 H1F Section L0301: Health & Life Sciences Test October … · 2014-10-23 · 1,2 3,4 5 6 7abc 7de 8ab 8cd 9 10 11abc 11de 1. 1. (5 marks) True or False: (a) T If a data value

6. A researcher is interested in studying the effect of a diet high in saccharin (an artificialsweetener) on health outcomes in rats. He randomly assigns 30 rats to receive a diethigh in saccharin and 10 rats to receive a regular diet. One of the outcomes of interestis whether or not the rats develop cancer. The table below shows the status of eachrat at the end of the study.

CancerYes No Total

Saccharin diet 7 23 30Regular diet 1 9 10

Total 8 32 40

(a) (1 mark) Which of the following is the conditional distribution of cancer giventhat diet is high in saccharin?0.233, 0.100 0.233, 0.767 0.875, 0.125 0.175, 0.575

(b) (2 marks) The rats were randomly assigned to receive the diet high in saccharinor the regular diet. What is the purpose of randomization in this study?

The purpose is to avoid bias in which rat gets which treatment so that we are ableto conclude that any difference in the rate of cancer between the two treatmentgroups was caused by the different diets.

(c) (2 marks) Suppose the researcher would like to study another 40 rats. His labis not large enough to accommodate them, so he recruits another researcherwith a lab in another location to also study 40 rats for a total of 80 rats. Eachresearcher randomly assigns 30 rats in his lab to receive a diet high in saccharinand 10 rats to receive a regular diet. In this example, the labs are: (Circle allthat apply.)

• a lurking variable

• a confounding variable

• a common cause

• a blocking variable (Only correct answer.)

• the explanatory variable

• the outcome variable

5

Page 6: STA220 H1F Section L0301: Health & Life Sciences Test October … · 2014-10-23 · 1,2 3,4 5 6 7abc 7de 8ab 8cd 9 10 11abc 11de 1. 1. (5 marks) True or False: (a) T If a data value

7. Weight is another outcome of interest in the study of the effect of a diet high insaccharin on health outcomes in rats. Weights of the types of rats being studied,before undergoing the dietary treatment, follow a normal distribution with mean 0.5kg and standard deviation 0.1 kg.

(a) (3 marks) What percentage of the rats being studied weigh more than 0.6 kgbefore the start of the study?

Let W be the weight of a randomly chosen rat before the study.

P (W > 0.6) = P

(Z >

0.6− 0.5

0.1

)where Z ∼ N(0, 1)

= P (Z > 1)

= 1− 0.8413

= 0.1587 or 15.9%

(b) (3 marks) What is the probability distribution for the average of the weights atthe start of the study of the 10 rats chosen to receive a regular diet. Is it exactor approximate?

N(

0.5, (0.1)2

10

)(in terms of the variance) or

N(

0.5, 0.1√10

)(in terms of the standard deviotion)

It is exact.

(c) (2 marks) At the start of the study, a collar weighing 0.04 kg is put on each ratand the rats are re-weighed. What are the mean and standard deviation of thedistribution of the weights of the rats including the collar?

Mean = 0.5 + 0.04 = 0.54 kgStandard deviation = 0.1 kg (unchanged)

(Question 7 continues on the next page.)

6

Page 7: STA220 H1F Section L0301: Health & Life Sciences Test October … · 2014-10-23 · 1,2 3,4 5 6 7abc 7de 8ab 8cd 9 10 11abc 11de 1. 1. (5 marks) True or False: (a) T If a data value

(Question 7 continued)

(d) (3 marks) Two rats are randomly chosen to undergo 2 different treatments andthe researcher is interested in the difference in their weights. What are the meanand standard deviation of the difference in the weights of two randomly chosenrats before the rats undergo the treatment?

Using the facts that E(X1 − X2) = E(X1) − E(X2) and Var(X1 − X2) =Var(X1) + Var(X2) if X1 and X2 are independent:Mean = 0 kgStandard deviation =

√0.12 + 0.12 = 0.14 kg

(e) (4 marks) Rats that are too small cannot be used in the study, so the study pro-tocol calls for the removal of rats that are in the lowest 10% of the distributionof weights. What is the cut-off weight for a rat to be removed from the study?

Let x be the cut-off weight. We want to find the value of x such that

P (W ≤ x) = 0.1

That is,

P

(Z ≤ x− 0.5

0.1

)= 0.1

From the standard normal table, x−0.50.1 = −1.28, giving x = 0.372 kg

7

Page 8: STA220 H1F Section L0301: Health & Life Sciences Test October … · 2014-10-23 · 1,2 3,4 5 6 7abc 7de 8ab 8cd 9 10 11abc 11de 1. 1. (5 marks) True or False: (a) T If a data value

8. The modified boxplots below show life expectancy for countries in Asia and SouthAmerica.

Asia South America

4050

6070

(a) (2 marks) Is life expectancy more variable in Asia or South America? How doyou know?

Asia. It has a larger range and larger IQR (the box is longer).

(b) (1 mark) Approximately what percentage of countries in Asia have life expectan-cies that are lower than the minimum life expectancy in South America?

About 25% since the minimum in South America is close to the first quartilein Asia.

(Question 8 continues on the next page.)

8

Page 9: STA220 H1F Section L0301: Health & Life Sciences Test October … · 2014-10-23 · 1,2 3,4 5 6 7abc 7de 8ab 8cd 9 10 11abc 11de 1. 1. (5 marks) True or False: (a) T If a data value

(Question 8 continued.)

(c) (2 marks) For the boxplot for Asia, the upper whisker is longer than the lowerwhisker. Explain why this happened.

In a box plot, a whisker extends to the closest data value that is closest to butwithin 1.5IQR of the quartile. For the lower whisker for Asia, this data value isclose to the first quartile. (And Asia is beyond Q1-1.5IQR.)

(d) A new measure of centre is the Gibbs-midpoint. The Gibbs-midpoint is definedas the average of the maximum and minimum.

i. (1 mark) Give an approximate value for the Gibbs-midpoint for life ex-pectancies in Asia.

75 + 35

2= 55

ii. (2 marks) Is the Gibbs-midpoint a robust statistic? Why or why not?

It is not robust. Removing an outlier (such as we see in Asia) changesits value by a large amount.

9

Page 10: STA220 H1F Section L0301: Health & Life Sciences Test October … · 2014-10-23 · 1,2 3,4 5 6 7abc 7de 8ab 8cd 9 10 11abc 11de 1. 1. (5 marks) True or False: (a) T If a data value

9. Suppose that in a certain population of married couples 30% of the husbands smokeand 20% of the wives smoke. And for couples where the wife smokes, 50% of thehusbands smoke. Find each of the following. If you are not given enough information,indicate what you would need to know in order to find the answer.

(a) (2 marks) The probability that both the husband and wife smoke

Let H be the event that the husband smokes and W be the event that the wifesmokes.

P (H and W ) = P (H|W )P (W ) = 0.5× 0.2 = 0.1

(b) (2 marks) The probability that neither the wife nor the husband smokes

P ( neither smokes ) = 1− P (H or W )

= 1− (0.3 + 0.2− 0.1)

= 0.6

(c) (2 marks) Is whether the husband smokes mutually exclusive of whether his wifesmokes? How do you know?

No. Since P (H|W ) = 0.5 > 0 (or P (H and W ) = 0.1 > 0), it is possiblethat both the husband and wife smoke.

10

Page 11: STA220 H1F Section L0301: Health & Life Sciences Test October … · 2014-10-23 · 1,2 3,4 5 6 7abc 7de 8ab 8cd 9 10 11abc 11de 1. 1. (5 marks) True or False: (a) T If a data value

10. A fair coin has probability 0.5 of coming up heads and 0.5 of coming up tails. Studentsin two sections of STA220 perform random experiments with fair coins. Section 1has 250 students and section 2 has 200 students. In section 1, each of the studentstosses a fair coin 10 times and records his or her value of p̂, the proportion of heads.In section 2, each of the students tosses a fair coin 40 times and records his or hervalue of p̂, the proportion of heads.

(a) (2 marks) In which section would you expect to have more values of p̂ that aregreater than 0.7? Why?

Section 1. The sampling distribution of p̂ in section 1 has more variabilitysince each student only tosses the coin 10 times (n = 10), compared to the 40tosses by each student in section 2. So in section 1, there is a higher chance ofgetting values of p̂ that are further from the mean.

(b) (2 marks) Below are two histograms.Histogram A shows the 250 values of p̂ obtained by the students in section 1.Histogram C shows the 450 values of p̂ obtained by the students in both sections.

Histogram A

Proportion of Heads

Frequency

0.2 0.4 0.6 0.8

010

2030

4050

60

Histogram C

Proportion of Heads

Frequency

0.2 0.4 0.6 0.8

050

100

150

Which histogram gives the best representation of a sampling distribution of p̂?Why?

Histogram A because all of the observations in histogram A are based on thesame sample size (n = 10). Histogram C is not a representation of a samplingdistribution since it involves samples of both size 10 and 40 which have differentvariability.

11

Page 12: STA220 H1F Section L0301: Health & Life Sciences Test October … · 2014-10-23 · 1,2 3,4 5 6 7abc 7de 8ab 8cd 9 10 11abc 11de 1. 1. (5 marks) True or False: (a) T If a data value

11. A botanist is interested in the probability that a cross between two pink floweringplants will produce a red flowering plant. She makes 100 crosses and 31 of themproduce a red flowering plant.

(a) (2 marks) Identify the parameter of interest and the relevant statistic (to esti-mate the parameter) in this problem.

The parameter: p = the probability a cross produces a red flowering plantThe statistic: p̂ = 31/100 = the estimated proportion of crosses that produce ared flowering plant

(b) (3 marks) Find a 90% confidence interval for the probability that a cross be-tween two pink flowering plants results in a red flowering plant.

For a 90% confidence interval, zα/2 = z0.05 = 1.64590% confidence interval for p:

31

100± 1.645

√31/100 (1− 31/100)

100= (0.234, 0.386)

(c) (2 marks) Verify that the conditions necessary for the construction of the con-fidence interval in (b) are satisfied.

We need the 100 crosses to be independent. We’ll assume that this is truebecause (hopefully) the botanist used independent plants.We need n to be large enough for the Central Limit Theorem to apply. To checkthis: np̂ = 100

(31100

)= 31 ≥ 10 and n(1− p̂) = 100(1−

(31100

)) = 69 ≥ 10

(Question 11 continues on the next page.)

12

Page 13: STA220 H1F Section L0301: Health & Life Sciences Test October … · 2014-10-23 · 1,2 3,4 5 6 7abc 7de 8ab 8cd 9 10 11abc 11de 1. 1. (5 marks) True or False: (a) T If a data value

(Question 11 continued)

(d) (1 mark) A genetic theory states that a cross between two pink flowering plantswill produce red flowering plants 25% of the time. Does the theory seem plau-sible. Why or why not?

Yes. 0.25 is in the confidence interval found in part (b).

(e) (4 marks) Below are four possible interpretations of the confidence interval youfound in part (b). Circle all of the valid interpretations.

• The probability is 0.90 that the interval you found in part (b) includes thetrue probability that a cross will produce a red flowering plant.

• We are 90% confident that the botanist’s sample estimate of the probabilityof producing a red flowering plant is between the upper and lower limit thatwas found in part (b).

• If the botanist repeated her procedure many times, calculatingthe confidence interval each time, 90% of the time she would getan interval that includes the probability that a cross will producea red flowering plant.(This is the only valid interpretation.)

• If the botanist repeated her procedure many times, calculating the confi-dence interval each time, 90% of the time she would get an interval thatincludes 0.31.

13