Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 ›...

187
Chapter 1: Introduction to data OpenIntro Statistics, 2nd Edition

Transcript of Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 ›...

Page 1: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Chapter 1: Introduction to data

OpenIntro Statistics, 2nd Edition

Page 2: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study

1 Case study

2 Data basics

3 Overview of data collection principles

4 Observational studies and sampling strategies

5 Experiments

6 Examining numerical data

7 Considering categorical data

8 Case study: Gender discrimination

OpenIntro Statistics, 2nd Edition

Chp 1: Intro. to data

Page 3: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study

Treating Chronic Fatigue Syndrome

Objective: Evaluate the effectiveness of cognitive-behaviortherapy for chronic fatigue syndrome.

Participant pool: 142 patients who were recruited from referralsby primary care physicians and consultants to a hospital clinicspecializing in chronic fatigue syndrome.

Actual participants: Only 60 of the 142 referred patients enteredthe study. Some were excluded because they didn’t meet thediagnostic criteria, some had other health issues, and somerefused to be a part of the study.

Deale et. al. Cognitive behavior therapy for chronic fatigue syndrome: A randomized controlled trial. The American Journal of

Psychiatry 154.3 (1997).

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 2 / 94

Page 4: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study

Study design

Patients randomly assigned to treatment and control groups, 30patients in each group:

Treatment: Cognitive behavior therapy – collaborative, educative,and with a behavioral emphasis. Patients were shown on howactivity could be increased steadily and safely withoutexacerbating symptoms.Control: Relaxation – No advice was given about how activitycould be increased. Instead progressive muscle relaxation,visualization, and rapid relaxation skills were taught.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 3 / 94

Page 5: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study

Results

The table below shows the distribution of patients with goodoutcomes at 6-month follow-up. Note that 7 patients dropped out ofthe study: 3 from the treatment and 4 from the control group.

Good outcomeYes No Total

Treatment 19 8 27GroupControl 5 21 26Total 24 29 53

Proportion with good outcomes in treatment group:

19/27 ≈ 0.70→ 70%

Proportion with good outcomes in control group:

5/26 ≈ 0.19→ 19%

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 4 / 94

Page 6: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study

Results

The table below shows the distribution of patients with goodoutcomes at 6-month follow-up. Note that 7 patients dropped out ofthe study: 3 from the treatment and 4 from the control group.

Good outcomeYes No Total

Treatment 19 8 27GroupControl 5 21 26Total 24 29 53

Proportion with good outcomes in treatment group:

19/27 ≈ 0.70→ 70%

Proportion with good outcomes in control group:

5/26 ≈ 0.19→ 19%

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 4 / 94

Page 7: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study

Results

The table below shows the distribution of patients with goodoutcomes at 6-month follow-up. Note that 7 patients dropped out ofthe study: 3 from the treatment and 4 from the control group.

Good outcomeYes No Total

Treatment 19 8 27GroupControl 5 21 26Total 24 29 53

Proportion with good outcomes in treatment group:

19/27 ≈ 0.70→ 70%

Proportion with good outcomes in control group:

5/26 ≈ 0.19→ 19%

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 4 / 94

Page 8: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study

Understanding the results

Do the data show a “real” difference between the groups?

Suppose you flip a coin 100 times. While the chance a coin landsheads in any given coin flip is 50%, we probably won’t observeexactly 50 heads. This type of fluctuation is part of almost anytype of data generating process.

The observed difference between the two groups (70 - 19 =51%) may be real, or may be due to natural variation.

Since the difference is quite large, it is more believable that thedifference is real.

We need statistical tools to determine if the difference is so largethat we should reject the notion that it was due to chance.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 5 / 94

Page 9: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study

Understanding the results

Do the data show a “real” difference between the groups?

Suppose you flip a coin 100 times. While the chance a coin landsheads in any given coin flip is 50%, we probably won’t observeexactly 50 heads. This type of fluctuation is part of almost anytype of data generating process.

The observed difference between the two groups (70 - 19 =51%) may be real, or may be due to natural variation.

Since the difference is quite large, it is more believable that thedifference is real.

We need statistical tools to determine if the difference is so largethat we should reject the notion that it was due to chance.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 5 / 94

Page 10: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study

Generalizing the results

Are the results of this study generalizable to all patients with chronicfatigue syndrome?

These patients had specific characteristics and volunteered to be apart of this study, therefore they may not be representative of allpatients with chronic fatigue syndrome. While we cannot immediatelygeneralize the results to all patients, this first study is encouraging.The method works for patients with some narrow set ofcharacteristics, and that gives hope that it will work, at least to somedegree, with other patients.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 6 / 94

Page 11: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study

Generalizing the results

Are the results of this study generalizable to all patients with chronicfatigue syndrome?

These patients had specific characteristics and volunteered to be apart of this study, therefore they may not be representative of allpatients with chronic fatigue syndrome. While we cannot immediatelygeneralize the results to all patients, this first study is encouraging.The method works for patients with some narrow set ofcharacteristics, and that gives hope that it will work, at least to somedegree, with other patients.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 6 / 94

Page 12: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Data basics

1 Case study

2 Data basicsObservations and variablesTypes of variablesRelationships among variablesAssociated and independent variables

3 Overview of data collection principles

4 Observational studies and sampling strategies

5 Experiments

6 Examining numerical data

7 Considering categorical data

8 Case study: Gender discrimination

OpenIntro Statistics, 2nd Edition

Chp 1: Intro. to data

Page 13: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Data basics Observations and variables

Data matrix

Data collected on students in a statistics class on a variety ofvariables:

variable↓

Stu. gender intro extra · · · dread

1 male extravert · · · 32 female extravert · · · 23 female introvert · · · 4 ←

4 female extravert · · · 2 observation...

......

......

86 male extravert · · · 3

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 7 / 94

Page 14: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Data basics Types of variables

Types of variables

all variables

numerical categorical

continuous discrete regularcategorical ordinal

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 8 / 94

Page 15: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Data basics Types of variables

Types of variables (cont.)

gender sleep bedtime countries dread

1 male 5 12-2 13 32 female 7 10-12 7 23 female 5.5 12-2 1 44 female 7 12-2 25 female 3 12-2 1 36 female 3 12-2 9 4

gender:

sleep:

bedtime:

countries:

dread:

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 9 / 94

Page 16: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Data basics Types of variables

Types of variables (cont.)

gender sleep bedtime countries dread

1 male 5 12-2 13 32 female 7 10-12 7 23 female 5.5 12-2 1 44 female 7 12-2 25 female 3 12-2 1 36 female 3 12-2 9 4

gender: categorical

sleep:

bedtime:

countries:

dread:

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 9 / 94

Page 17: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Data basics Types of variables

Types of variables (cont.)

gender sleep bedtime countries dread

1 male 5 12-2 13 32 female 7 10-12 7 23 female 5.5 12-2 1 44 female 7 12-2 25 female 3 12-2 1 36 female 3 12-2 9 4

gender: categorical

sleep:

bedtime:

countries:

dread:

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 9 / 94

Page 18: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Data basics Types of variables

Types of variables (cont.)

gender sleep bedtime countries dread

1 male 5 12-2 13 32 female 7 10-12 7 23 female 5.5 12-2 1 44 female 7 12-2 25 female 3 12-2 1 36 female 3 12-2 9 4

gender: categorical

sleep: numerical, continuous

bedtime:

countries:

dread:

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 9 / 94

Page 19: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Data basics Types of variables

Types of variables (cont.)

gender sleep bedtime countries dread

1 male 5 12-2 13 32 female 7 10-12 7 23 female 5.5 12-2 1 44 female 7 12-2 25 female 3 12-2 1 36 female 3 12-2 9 4

gender: categorical

sleep: numerical, continuous

bedtime:

countries:

dread:

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 9 / 94

Page 20: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Data basics Types of variables

Types of variables (cont.)

gender sleep bedtime countries dread

1 male 5 12-2 13 32 female 7 10-12 7 23 female 5.5 12-2 1 44 female 7 12-2 25 female 3 12-2 1 36 female 3 12-2 9 4

gender: categorical

sleep: numerical, continuous

bedtime: categorical, ordinal

countries:

dread:

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 9 / 94

Page 21: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Data basics Types of variables

Types of variables (cont.)

gender sleep bedtime countries dread

1 male 5 12-2 13 32 female 7 10-12 7 23 female 5.5 12-2 1 44 female 7 12-2 25 female 3 12-2 1 36 female 3 12-2 9 4

gender: categorical

sleep: numerical, continuous

bedtime: categorical, ordinal

countries:

dread:

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 9 / 94

Page 22: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Data basics Types of variables

Types of variables (cont.)

gender sleep bedtime countries dread

1 male 5 12-2 13 32 female 7 10-12 7 23 female 5.5 12-2 1 44 female 7 12-2 25 female 3 12-2 1 36 female 3 12-2 9 4

gender: categorical

sleep: numerical, continuous

bedtime: categorical, ordinal

countries: numerical, discrete

dread:

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 9 / 94

Page 23: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Data basics Types of variables

Types of variables (cont.)

gender sleep bedtime countries dread

1 male 5 12-2 13 32 female 7 10-12 7 23 female 5.5 12-2 1 44 female 7 12-2 25 female 3 12-2 1 36 female 3 12-2 9 4

gender: categorical

sleep: numerical, continuous

bedtime: categorical, ordinal

countries: numerical, discrete

dread:

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 9 / 94

Page 24: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Data basics Types of variables

Types of variables (cont.)

gender sleep bedtime countries dread

1 male 5 12-2 13 32 female 7 10-12 7 23 female 5.5 12-2 1 44 female 7 12-2 25 female 3 12-2 1 36 female 3 12-2 9 4

gender: categorical

sleep: numerical, continuous

bedtime: categorical, ordinal

countries: numerical, discrete

dread: categorical, ordinal - could also be used as numerical

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 9 / 94

Page 25: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Data basics Types of variables

Practice

What type of variable is a telephone area code?

(a) numerical, continuous

(b) numerical, discrete

(c) categorical

(d) categorical, ordinal

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 10 / 94

Page 26: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Data basics Types of variables

Practice

What type of variable is a telephone area code?

(a) numerical, continuous

(b) numerical, discrete

(c) categorical

(d) categorical, ordinal

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 10 / 94

Page 27: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Data basics Relationships among variables

Relationships among variables

Does there appear to be a relationship between number of alcoholicdrinks consumed per week and age at first alcohol consumption?

0 10 20 30 40 50 60 70

3.0

3.5

4.0

Hours of study / week

GPA

Can you spot anything unusual about any of the data points?

There is one student with GPA > 4.0, this is likely a data error.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 11 / 94

Page 28: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Data basics Relationships among variables

Relationships among variables

Does there appear to be a relationship between number of alcoholicdrinks consumed per week and age at first alcohol consumption?

0 10 20 30 40 50 60 70

3.0

3.5

4.0

Hours of study / week

GPA

Can you spot anything unusual about any of the data points?

There is one student with GPA > 4.0, this is likely a data error.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 11 / 94

Page 29: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Data basics Relationships among variables

Relationships among variables

Does there appear to be a relationship between number of alcoholicdrinks consumed per week and age at first alcohol consumption?

0 10 20 30 40 50 60 70

3.0

3.5

4.0

Hours of study / week

GPA

Can you spot anything unusual about any of the data points?

There is one student with GPA > 4.0, this is likely a data error.OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 11 / 94

Page 30: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Data basics Associated and independent variables

Practice

Based on the scatterplot on theright, which of the following state-ments is correct about the headand skull lengths of possums?

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●●

● ●

●●

●●

●●●

85 90 95 100

50

55

60

65

head length (mm)

skul

l wid

th (

mm

)

(a) There is no relationship between head length and skull width, i.e.the variables are independent.

(b) Head length and skull width are positively associated.

(c) Skull width and head length are negatively associated.

(d) A longer head causes the skull to be wider.

(e) A wider skull causes the head to be longer.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 12 / 94

Page 31: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Data basics Associated and independent variables

Practice

Based on the scatterplot on theright, which of the following state-ments is correct about the headand skull lengths of possums?

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●●

● ●

●●

●●

●●●

85 90 95 100

50

55

60

65

head length (mm)

skul

l wid

th (

mm

)

(a) There is no relationship between head length and skull width, i.e.the variables are independent.

(b) Head length and skull width are positively associated.

(c) Skull width and head length are negatively associated.

(d) A longer head causes the skull to be wider.

(e) A wider skull causes the head to be longer.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 12 / 94

Page 32: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Data basics Associated and independent variables

Associated vs. independent

When two variables show some connection with one another,they are called associated variables.

Associated variables can also be called dependent variables andvice-versa.

If two variables are not associated, i.e. there is no evidentconnection between the two, then they are said to beindependent.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 13 / 94

Page 33: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles

1 Case study

2 Data basics

3 Overview of data collection principlesPopulations and samplesAnecdotal evidenceSampling from a populationExplanatory and response variablesObservational studies and experiments

4 Observational studies and sampling strategies

5 Experiments

6 Examining numerical data

7 Considering categorical data

8 Case study: Gender discrimination

OpenIntro Statistics, 2nd Edition

Chp 1: Intro. to data

Page 34: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Populations and samples

Populations and samples

http:// well.blogs.nytimes.com/ 2012/ 08/ 29/

finding-your-ideal-running-form

Research question: Can peoplebecome better, more efficientrunners on their own, merely byrunning?

Population of interest: All people

Sample: Group of adult women who recently joined a running groupPopulation to which results can be generalized: Adult women, if thedata are randomly sampled

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 14 / 94

Page 35: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Populations and samples

Populations and samples

http:// well.blogs.nytimes.com/ 2012/ 08/ 29/

finding-your-ideal-running-form

Research question: Can peoplebecome better, more efficientrunners on their own, merely byrunning?Population of interest:

All people

Sample: Group of adult women who recently joined a running groupPopulation to which results can be generalized: Adult women, if thedata are randomly sampled

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 14 / 94

Page 36: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Populations and samples

Populations and samples

http:// well.blogs.nytimes.com/ 2012/ 08/ 29/

finding-your-ideal-running-form

Research question: Can peoplebecome better, more efficientrunners on their own, merely byrunning?Population of interest: All people

Sample: Group of adult women who recently joined a running groupPopulation to which results can be generalized: Adult women, if thedata are randomly sampled

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 14 / 94

Page 37: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Populations and samples

Populations and samples

http:// well.blogs.nytimes.com/ 2012/ 08/ 29/

finding-your-ideal-running-form

Research question: Can peoplebecome better, more efficientrunners on their own, merely byrunning?Population of interest: All people

Sample: Group of adult women who recently joined a running group

Population to which results can be generalized: Adult women, if thedata are randomly sampled

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 14 / 94

Page 38: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Populations and samples

Populations and samples

http:// well.blogs.nytimes.com/ 2012/ 08/ 29/

finding-your-ideal-running-form

Research question: Can peoplebecome better, more efficientrunners on their own, merely byrunning?Population of interest: All people

Sample: Group of adult women who recently joined a running groupPopulation to which results can be generalized:

Adult women, if thedata are randomly sampled

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 14 / 94

Page 39: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Populations and samples

Populations and samples

http:// well.blogs.nytimes.com/ 2012/ 08/ 29/

finding-your-ideal-running-form

Research question: Can peoplebecome better, more efficientrunners on their own, merely byrunning?Population of interest: All people

Sample: Group of adult women who recently joined a running groupPopulation to which results can be generalized: Adult women, if thedata are randomly sampled

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 14 / 94

Page 40: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Anecdotal evidence

Anecdotal evidence and early smoking research

Anti-smoking research started in the 1930s and 1940s whencigarette smoking became increasingly popular. While somesmokers seemed to be sensitive to cigarette smoke, others werecompletely unaffected.

Anti-smoking research was faced with resistance based onanecdotal evidence such as “My uncle smokes three packs a dayand he’s in perfectly good health”, evidence based on a limitedsample size that might not be representative of the population.

It was concluded that “smoking is a complex human behavior, byits nature difficult to study, confounded by human variability.”

In time researchers were able to examine larger samples ofcases (smokers), and trends showing that smoking has negativehealth impacts became much clearer.

Brandt, The Cigarette Century (2009), Basic Books.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 15 / 94

Page 41: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Sampling from a population

Census

Wouldn’t it be better to just include everyone and “sample” theentire population?

This is called a census.

There are problems with taking a census:

It can be difficult to complete a census: there always seem to besome individuals who are hard to locate or hard to measure. Andthese difficult-to-find people may have certain characteristics thatdistinguish them from the rest of the population.Populations rarely stand still. Even if you could take a census, thepopulation changes constantly, so it’s never possible to get aperfect measure.Taking a census may be more complex than sampling.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 16 / 94

Page 42: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Sampling from a population

Census

Wouldn’t it be better to just include everyone and “sample” theentire population?

This is called a census.

There are problems with taking a census:

It can be difficult to complete a census: there always seem to besome individuals who are hard to locate or hard to measure. Andthese difficult-to-find people may have certain characteristics thatdistinguish them from the rest of the population.Populations rarely stand still. Even if you could take a census, thepopulation changes constantly, so it’s never possible to get aperfect measure.Taking a census may be more complex than sampling.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 16 / 94

Page 43: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Sampling from a population

http:// www.npr.org/ templates/ story/ story.php?storyId=125380052

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 17 / 94

Page 44: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Sampling from a population

Exploratory analysis to inference

Sampling is natural.

Think about sampling something you are cooking - you taste(examine) a small part of what you’re cooking to get an ideaabout the dish as a whole.When you taste a spoonful of soup and decide the spoonful youtasted isn’t salty enough, that’s exploratory analysis.If you generalize and conclude that your entire soup needs salt,that’s an inference.For your inference to be valid, the spoonful you tasted (thesample) needs to be representative of the entire pot (thepopulation).

If your spoonful comes only from the surface and the salt iscollected at the bottom of the pot, what you tasted is probably notrepresentative of the whole pot.If you first stir the soup thoroughly before you taste, your spoonfulwill more likely be representative of the whole pot.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 18 / 94

Page 45: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Sampling from a population

Exploratory analysis to inference

Sampling is natural.Think about sampling something you are cooking - you taste(examine) a small part of what you’re cooking to get an ideaabout the dish as a whole.

When you taste a spoonful of soup and decide the spoonful youtasted isn’t salty enough, that’s exploratory analysis.If you generalize and conclude that your entire soup needs salt,that’s an inference.For your inference to be valid, the spoonful you tasted (thesample) needs to be representative of the entire pot (thepopulation).

If your spoonful comes only from the surface and the salt iscollected at the bottom of the pot, what you tasted is probably notrepresentative of the whole pot.If you first stir the soup thoroughly before you taste, your spoonfulwill more likely be representative of the whole pot.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 18 / 94

Page 46: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Sampling from a population

Exploratory analysis to inference

Sampling is natural.Think about sampling something you are cooking - you taste(examine) a small part of what you’re cooking to get an ideaabout the dish as a whole.When you taste a spoonful of soup and decide the spoonful youtasted isn’t salty enough, that’s exploratory analysis.

If you generalize and conclude that your entire soup needs salt,that’s an inference.For your inference to be valid, the spoonful you tasted (thesample) needs to be representative of the entire pot (thepopulation).

If your spoonful comes only from the surface and the salt iscollected at the bottom of the pot, what you tasted is probably notrepresentative of the whole pot.If you first stir the soup thoroughly before you taste, your spoonfulwill more likely be representative of the whole pot.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 18 / 94

Page 47: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Sampling from a population

Exploratory analysis to inference

Sampling is natural.Think about sampling something you are cooking - you taste(examine) a small part of what you’re cooking to get an ideaabout the dish as a whole.When you taste a spoonful of soup and decide the spoonful youtasted isn’t salty enough, that’s exploratory analysis.If you generalize and conclude that your entire soup needs salt,that’s an inference.

For your inference to be valid, the spoonful you tasted (thesample) needs to be representative of the entire pot (thepopulation).

If your spoonful comes only from the surface and the salt iscollected at the bottom of the pot, what you tasted is probably notrepresentative of the whole pot.If you first stir the soup thoroughly before you taste, your spoonfulwill more likely be representative of the whole pot.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 18 / 94

Page 48: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Sampling from a population

Exploratory analysis to inference

Sampling is natural.Think about sampling something you are cooking - you taste(examine) a small part of what you’re cooking to get an ideaabout the dish as a whole.When you taste a spoonful of soup and decide the spoonful youtasted isn’t salty enough, that’s exploratory analysis.If you generalize and conclude that your entire soup needs salt,that’s an inference.For your inference to be valid, the spoonful you tasted (thesample) needs to be representative of the entire pot (thepopulation).

If your spoonful comes only from the surface and the salt iscollected at the bottom of the pot, what you tasted is probably notrepresentative of the whole pot.If you first stir the soup thoroughly before you taste, your spoonfulwill more likely be representative of the whole pot.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 18 / 94

Page 49: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Sampling from a population

Sampling bias

Non-response: If only a small fraction of the randomly sampledpeople choose to respond to a survey, the sample may no longerbe representative of the population.

Voluntary response: Occurs when the sample consists of peoplewho volunteer to respond because they have strong opinions onthe issue. Such a sample will also not be representative of thepopulation.

cnn.com, Jan 14, 2012

Convenience sample: Individuals who are easily accessible aremore likely to be included in the sample.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 19 / 94

Page 50: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Sampling from a population

Sampling bias

Non-response: If only a small fraction of the randomly sampledpeople choose to respond to a survey, the sample may no longerbe representative of the population.Voluntary response: Occurs when the sample consists of peoplewho volunteer to respond because they have strong opinions onthe issue. Such a sample will also not be representative of thepopulation.

cnn.com, Jan 14, 2012

Convenience sample: Individuals who are easily accessible aremore likely to be included in the sample.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 19 / 94

Page 51: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Sampling from a population

Sampling bias

Non-response: If only a small fraction of the randomly sampledpeople choose to respond to a survey, the sample may no longerbe representative of the population.Voluntary response: Occurs when the sample consists of peoplewho volunteer to respond because they have strong opinions onthe issue. Such a sample will also not be representative of thepopulation.

cnn.com, Jan 14, 2012

Convenience sample: Individuals who are easily accessible aremore likely to be included in the sample.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 19 / 94

Page 52: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Sampling from a population

Sampling bias

Non-response: If only a small fraction of the randomly sampledpeople choose to respond to a survey, the sample may no longerbe representative of the population.Voluntary response: Occurs when the sample consists of peoplewho volunteer to respond because they have strong opinions onthe issue. Such a sample will also not be representative of thepopulation.

cnn.com, Jan 14, 2012

Convenience sample: Individuals who are easily accessible aremore likely to be included in the sample.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 19 / 94

Page 53: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Sampling from a population

Sampling bias

Non-response: If only a small fraction of the randomly sampledpeople choose to respond to a survey, the sample may no longerbe representative of the population.Voluntary response: Occurs when the sample consists of peoplewho volunteer to respond because they have strong opinions onthe issue. Such a sample will also not be representative of thepopulation.

cnn.com, Jan 14, 2012

Convenience sample: Individuals who are easily accessible aremore likely to be included in the sample.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 19 / 94

Page 54: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Sampling from a population

Sampling bias example: Landon vs. FDR

A historical example of a biased sample yielding misleading results:

In 1936, Landonsought theRepublicanpresidentialnomination opposingthe re-election ofFDR.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 20 / 94

Page 55: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Sampling from a population

The Literary Digest Poll

The Literary Digest polled about 10 millionAmericans, and got responses from about2.4 million.

The poll showed that Landon would likelybe the overwhelming winner and FDRwould get only 43% of the votes.

Election result: FDR won, with 62% of thevotes.

The magazine was completely discredited because of the poll,and was soon discontinued.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 21 / 94

Page 56: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Sampling from a population

The Literary Digest Poll – what went wrong?

The magazine had surveyed

its own readers,registered automobile owners, andregistered telephone users.

These groups had incomes well above the national average ofthe day (remember, this is Great Depression era) which resultedin lists of voters far more likely to support Republicans than atruly typical voter of the time, i.e. the sample was notrepresentative of the American population at the time.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 22 / 94

Page 57: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Sampling from a population

Large samples are preferable, but...

The Literary Digest election poll was based on a sample size of2.4 million, which is huge, but since the sample was biased, thesample did not yield an accurate prediction.

Back to the soup analogy: If the soup is not well stirred, it doesn’tmatter how large a spoon you have, it will still not taste right. Ifthe soup is well stirred, a small spoon will suffice to test the soup.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 23 / 94

Page 58: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Sampling from a population

Practice

A school district is considering whether it will no longer allow high schoolstudents to park at school after two recent accidents where students wereseverely injured. As a first step, they survey parents by mail, asking themwhether or not the parents would object to this policy change. Of 6,000 sur-veys that go out, 1,200 are returned. Of these 1,200 surveys that were com-pleted, 960 agreed with the policy change and 240 disagreed. Which of thefollowing statements are true?

I. Some of the mailings may have never reached the parents.

II. The school district has strong support from parents to move forwardwith the policy approval.

III. It is possible that majority of the parents of high school studentsdisagree with the policy change.

IV. The survey results are unlikely to be biased because all parents weremailed a survey.

(a) Only I (b) I and II (c) I and III (d) III and IV (e) Only IV

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 24 / 94

Page 59: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Sampling from a population

Practice

A school district is considering whether it will no longer allow high schoolstudents to park at school after two recent accidents where students wereseverely injured. As a first step, they survey parents by mail, asking themwhether or not the parents would object to this policy change. Of 6,000 sur-veys that go out, 1,200 are returned. Of these 1,200 surveys that were com-pleted, 960 agreed with the policy change and 240 disagreed. Which of thefollowing statements are true?

I. Some of the mailings may have never reached the parents.

II. The school district has strong support from parents to move forwardwith the policy approval.

III. It is possible that majority of the parents of high school studentsdisagree with the policy change.

IV. The survey results are unlikely to be biased because all parents weremailed a survey.

(a) Only I (b) I and II (c) I and III (d) III and IV (e) Only IV

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 24 / 94

Page 60: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Explanatory and response variables

Explanatory and response variables

To identify the explanatory variable in a pair of variables, identifywhich of the two is suspected of affecting the other:

explanatory variablemight affect−−−−−−−−→response variable

Labeling variables as explanatory and response does notguarantee the relationship between the two is actually causal,even if there is an association identified between the twovariables. We use these labels only to keep track of whichvariable we suspect affects the other.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 25 / 94

Page 61: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Observational studies and experiments

Observational studies and experiments

Observational study: Researchers collect data in a way that doesnot directly interfere with how the data arise, i.e. they merely“observe”, and can only establish an association between theexplanatory and response variables.

Experiment: Researchers randomly assign subjects to varioustreatments in order to establish causal connections between theexplanatory and response variables.If you’re going to walk away with one thing from this class, let itbe “correlation does not imply causation”.

http:// xkcd.com/ 552/

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 26 / 94

Page 62: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Observational studies and experiments

Observational studies and experiments

Observational study: Researchers collect data in a way that doesnot directly interfere with how the data arise, i.e. they merely“observe”, and can only establish an association between theexplanatory and response variables.Experiment: Researchers randomly assign subjects to varioustreatments in order to establish causal connections between theexplanatory and response variables.

If you’re going to walk away with one thing from this class, let itbe “correlation does not imply causation”.

http:// xkcd.com/ 552/

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 26 / 94

Page 63: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Overview of data collection principles Observational studies and experiments

Observational studies and experiments

Observational study: Researchers collect data in a way that doesnot directly interfere with how the data arise, i.e. they merely“observe”, and can only establish an association between theexplanatory and response variables.Experiment: Researchers randomly assign subjects to varioustreatments in order to establish causal connections between theexplanatory and response variables.If you’re going to walk away with one thing from this class, let itbe “correlation does not imply causation”.

http:// xkcd.com/ 552/

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 26 / 94

Page 64: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Observational studies and sampling strategies

1 Case study

2 Data basics

3 Overview of data collection principles

4 Observational studies and sampling strategiesConfoundingSampling strategies

5 Experiments

6 Examining numerical data

7 Considering categorical data

8 Case study: Gender discrimination

OpenIntro Statistics, 2nd Edition

Chp 1: Intro. to data

Page 65: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Observational studies and sampling strategies Confounding

http:// www.peertrainer.com/ LoungeCommunityThread.aspx?ForumID=1&ThreadID=3118

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 27 / 94

Page 66: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Observational studies and sampling strategies Confounding

What type of study is this, observational study or an experiment? “Girls

who regularly ate breakfast, particularly one that includes cereal, were slimmer than

those who skipped the morning meal, according to a study that tracked nearly 2,400

girls for 10 years. [...] As part of the survey, the girls were asked once a year what

they had eaten during the previous three days.”

This is an observational study since the researchers merely observedthe behavior of the girls (subjects) as opposed to imposing treatmentson them.

What is the conclusion of the study?

There is an association between girls eating breakfast and beingslimmer.

Who sponsored the study?

General Mills.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 28 / 94

Page 67: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Observational studies and sampling strategies Confounding

What type of study is this, observational study or an experiment? “Girls

who regularly ate breakfast, particularly one that includes cereal, were slimmer than

those who skipped the morning meal, according to a study that tracked nearly 2,400

girls for 10 years. [...] As part of the survey, the girls were asked once a year what

they had eaten during the previous three days.”

This is an observational study since the researchers merely observedthe behavior of the girls (subjects) as opposed to imposing treatmentson them.What is the conclusion of the study?

There is an association between girls eating breakfast and beingslimmer.

Who sponsored the study?

General Mills.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 28 / 94

Page 68: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Observational studies and sampling strategies Confounding

What type of study is this, observational study or an experiment? “Girls

who regularly ate breakfast, particularly one that includes cereal, were slimmer than

those who skipped the morning meal, according to a study that tracked nearly 2,400

girls for 10 years. [...] As part of the survey, the girls were asked once a year what

they had eaten during the previous three days.”

This is an observational study since the researchers merely observedthe behavior of the girls (subjects) as opposed to imposing treatmentson them.What is the conclusion of the study?

There is an association between girls eating breakfast and beingslimmer.Who sponsored the study?

General Mills.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 28 / 94

Page 69: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Observational studies and sampling strategies Confounding

What type of study is this, observational study or an experiment? “Girls

who regularly ate breakfast, particularly one that includes cereal, were slimmer than

those who skipped the morning meal, according to a study that tracked nearly 2,400

girls for 10 years. [...] As part of the survey, the girls were asked once a year what

they had eaten during the previous three days.”

This is an observational study since the researchers merely observedthe behavior of the girls (subjects) as opposed to imposing treatmentson them.What is the conclusion of the study?

There is an association between girls eating breakfast and beingslimmer.Who sponsored the study?

General Mills.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 28 / 94

Page 70: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Observational studies and sampling strategies Confounding

3 possible explanations

1. Eating breakfast causes girls to be thinner.

2. Being thin causes girls to eat breakfast.

3. A third variable is responsible for both. What could it be?An extraneous variable that affects both the explanatory and theresponse variable and that make it seem like there is arelationship between the two are called confounding variables.

Images from: http:// www.appforhealth.com/ wp-content/ uploads/ 2011/ 08/ ipn-cerealfrijo-300x135.jpg,

http:// www.dreamstime.com/ stock-photography-too-thin-woman-anorexia-model- image2814892.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 29 / 94

Page 71: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Observational studies and sampling strategies Confounding

3 possible explanations

1. Eating breakfast causes girls to be thinner.

2. Being thin causes girls to eat breakfast.

3. A third variable is responsible for both. What could it be?An extraneous variable that affects both the explanatory and theresponse variable and that make it seem like there is arelationship between the two are called confounding variables.

Images from: http:// www.appforhealth.com/ wp-content/ uploads/ 2011/ 08/ ipn-cerealfrijo-300x135.jpg,

http:// www.dreamstime.com/ stock-photography-too-thin-woman-anorexia-model- image2814892.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 29 / 94

Page 72: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Observational studies and sampling strategies Confounding

3 possible explanations

1. Eating breakfast causes girls to be thinner.

2. Being thin causes girls to eat breakfast.

3. A third variable is responsible for both. What could it be?An extraneous variable that affects both the explanatory and theresponse variable and that make it seem like there is arelationship between the two are called confounding variables.

Images from: http:// www.appforhealth.com/ wp-content/ uploads/ 2011/ 08/ ipn-cerealfrijo-300x135.jpg,

http:// www.dreamstime.com/ stock-photography-too-thin-woman-anorexia-model- image2814892.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 29 / 94

Page 73: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Observational studies and sampling strategies Confounding

3 possible explanations

1. Eating breakfast causes girls to be thinner.

2. Being thin causes girls to eat breakfast.

3. A third variable is responsible for both. What could it be?An extraneous variable that affects both the explanatory and theresponse variable and that make it seem like there is arelationship between the two are called confounding variables.

Images from: http:// www.appforhealth.com/ wp-content/ uploads/ 2011/ 08/ ipn-cerealfrijo-300x135.jpg,

http:// www.dreamstime.com/ stock-photography-too-thin-woman-anorexia-model- image2814892.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 29 / 94

Page 74: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Observational studies and sampling strategies Confounding

Prospective vs. retrospective studies

A prospective study identifies individuals and collects informationas events unfold.

Example: The Nurses Health Study has been recruitingregistered nurses and then collecting data from them usingquestionnaires since 1976.

Retrospective studies collect data after events have taken place.Example: Researchers reviewing past events in medical records.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 30 / 94

Page 75: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Observational studies and sampling strategies Sampling strategies

Obtaining good samples

Almost all statistical methods are based on the notion of impliedrandomness.

If observational data are not collected in a random frameworkfrom a population, these statistical methods – the estimates anderrors associated with the estimates – are not reliable.

Most commonly used random sampling techniques are simple,stratified, and cluster sampling.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 31 / 94

Page 76: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Observational studies and sampling strategies Sampling strategies

Simple random sample

Randomly select cases from the population, where there is no impliedconnection between the points that are selected.

Index

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

Index

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

Stratum 1

Stratum 2

Stratum 3

Stratum 4

Stratum 5

Stratum 6

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 32 / 94

Page 77: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Observational studies and sampling strategies Sampling strategies

Stratified sample

Strata are made up of similar observations. We take a simple randomsample from each stratum.

Index

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

Index

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

Stratum 1

Stratum 2

Stratum 3

Stratum 4

Stratum 5

Stratum 6

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 33 / 94

Page 78: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Observational studies and sampling strategies Sampling strategies

Cluster sample

Clusters are usually not made up of homogeneous observations, andwe take a simple random sample from a random sample of clusters.Usually preferred for economical reasons.

Index

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

Index

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

Stratum 1

Stratum 2

Stratum 3

Stratum 4

Stratum 5

Stratum 6

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 34 / 94

Page 79: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Observational studies and sampling strategies Sampling strategies

Practice

A city council has requested a household survey be conducted in asuburban area of their city. The area is broken into many distinct andunique neighborhoods, some including large homes, some with onlyapartments. Which approach would likely be the least effective?

(a) Simple random sampling

(b) Cluster sampling

(c) Stratified sampling

(d) Blocked sampling

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 35 / 94

Page 80: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Observational studies and sampling strategies Sampling strategies

Practice

A city council has requested a household survey be conducted in asuburban area of their city. The area is broken into many distinct andunique neighborhoods, some including large homes, some with onlyapartments. Which approach would likely be the least effective?

(a) Simple random sampling

(b) Cluster sampling

(c) Stratified sampling

(d) Blocked sampling

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 35 / 94

Page 81: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Experiments

1 Case study

2 Data basics

3 Overview of data collection principles

4 Observational studies and sampling strategies

5 Experiments

6 Examining numerical data

7 Considering categorical data

8 Case study: Gender discrimination

OpenIntro Statistics, 2nd Edition

Chp 1: Intro. to data

Page 82: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Experiments

Principles of experimental design

1. Control: Compare treatment of interest to a control group.

2. Randomize: Randomly assign subjects to treatments, andrandomly sample from the population whenever possible.

3. Replicate: Within a study, replicate by collecting a sufficientlylarge sample. Or replicate the entire study.

4. Block: If there are variables that are known or suspected to affectthe response variable, first group subjects into blocks based onthese variables, and then randomize cases within each block totreatment groups.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 36 / 94

Page 83: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Experiments

More on blocking

We would like to design an experiment toinvestigate if energy gels makes you run faster:

Treatment: energy gelControl: no energy gel

It is suspected that energy gels might affect proand amateur athletes differently, therefore weblock for pro status:

Divide the sample to pro and amateurRandomly assign pro athletes to treatment andcontrol groupsRandomly assign amateur athletes totreatment and control groupsPro/amateur status is equally represented inthe resulting treatment and control groups

Why is this important? Can you think of other variables to block for?

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 37 / 94

Page 84: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Experiments

More on blocking

We would like to design an experiment toinvestigate if energy gels makes you run faster:

Treatment: energy gelControl: no energy gel

It is suspected that energy gels might affect proand amateur athletes differently, therefore weblock for pro status:

Divide the sample to pro and amateurRandomly assign pro athletes to treatment andcontrol groupsRandomly assign amateur athletes totreatment and control groupsPro/amateur status is equally represented inthe resulting treatment and control groups

Why is this important? Can you think of other variables to block for?

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 37 / 94

Page 85: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Experiments

More on blocking

We would like to design an experiment toinvestigate if energy gels makes you run faster:

Treatment: energy gelControl: no energy gel

It is suspected that energy gels might affect proand amateur athletes differently, therefore weblock for pro status:

Divide the sample to pro and amateurRandomly assign pro athletes to treatment andcontrol groupsRandomly assign amateur athletes totreatment and control groupsPro/amateur status is equally represented inthe resulting treatment and control groups

Why is this important? Can you think of other variables to block for?

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 37 / 94

Page 86: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Experiments

More on blocking

We would like to design an experiment toinvestigate if energy gels makes you run faster:

Treatment: energy gelControl: no energy gel

It is suspected that energy gels might affect proand amateur athletes differently, therefore weblock for pro status:

Divide the sample to pro and amateurRandomly assign pro athletes to treatment andcontrol groupsRandomly assign amateur athletes totreatment and control groupsPro/amateur status is equally represented inthe resulting treatment and control groups

Why is this important? Can you think of other variables to block for?

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 37 / 94

Page 87: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Experiments

More on blocking

We would like to design an experiment toinvestigate if energy gels makes you run faster:

Treatment: energy gelControl: no energy gel

It is suspected that energy gels might affect proand amateur athletes differently, therefore weblock for pro status:

Divide the sample to pro and amateurRandomly assign pro athletes to treatment andcontrol groupsRandomly assign amateur athletes totreatment and control groupsPro/amateur status is equally represented inthe resulting treatment and control groups

Why is this important? Can you think of other variables to block for?

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 37 / 94

Page 88: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Experiments

Practice

A study is designed to test the effect of light level and noise level onexam performance of students. The researcher also believes that lightand noise levels might have different effects on males and females,so wants to make sure both genders are equally represented in eachgroup. Which of the below is correct?

(a) There are 3 explanatory variables (light, noise, gender) and 1response variable (exam performance)

(b) There are 2 explanatory variables (light and noise), 1 blockingvariable (gender), and 1 response variable (exam performance)

(c) There is 1 explanatory variable (gender) and 3 response variables(light, noise, exam performance)

(d) There are 2 blocking variables (light and noise), 1 explanatoryvariable (gender), and 1 response variable (exam performance)

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 38 / 94

Page 89: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Experiments

Practice

A study is designed to test the effect of light level and noise level onexam performance of students. The researcher also believes that lightand noise levels might have different effects on males and females,so wants to make sure both genders are equally represented in eachgroup. Which of the below is correct?

(a) There are 3 explanatory variables (light, noise, gender) and 1response variable (exam performance)

(b) There are 2 explanatory variables (light and noise), 1 blockingvariable (gender), and 1 response variable (exam performance)

(c) There is 1 explanatory variable (gender) and 3 response variables(light, noise, exam performance)

(d) There are 2 blocking variables (light and noise), 1 explanatoryvariable (gender), and 1 response variable (exam performance)

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 38 / 94

Page 90: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Experiments

Difference between blocking and explanatory variables

Factors are conditions we can impose on the experimental units.

Blocking variables are characteristics that the experimental unitscome with, that we would like to control for.

Blocking is like stratifying, except used in experimental settingswhen randomly assigning, as opposed to when sampling.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 39 / 94

Page 91: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Experiments

More experimental design terminology...

Placebo: fake treatment, often used as the control group formedical studies

Placebo effect: experimental units showing improvement simplybecause they believe they are receiving a special treatment

Blinding: when experimental units do not know whether they arein the control or treatment group

Double-blind: when both the experimental units and theresearchers who interact with the patients do not know who is inthe control and who is in the treatment group

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 40 / 94

Page 92: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Experiments

Practice

What is the main difference between observational studies and exper-iments?

(a) Experiments take place in a lab while observational studies donot need to.

(b) In an observational study we only look at what happened in thepast.

(c) Most experiments use random assignment while observationalstudies do not.

(d) Observational studies are completely useless since no causalinference can be made based on their findings.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 41 / 94

Page 93: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Experiments

Practice

What is the main difference between observational studies and exper-iments?

(a) Experiments take place in a lab while observational studies donot need to.

(b) In an observational study we only look at what happened in thepast.

(c) Most experiments use random assignment while observationalstudies do not.

(d) Observational studies are completely useless since no causalinference can be made based on their findings.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 41 / 94

Page 94: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Experiments

Random assignment vs. random sampling

Random assignment

No random assignment

Random sampling

Causal conclusion, generalized to the whole

population.

No causal conclusion, correlation statement

generalized to the whole population.

Generalizability

No random sampling

Causal conclusion, only for the sample.

No causal conclusion, correlation statement only

for the sample.No

generalizability

Causation Correlation

ideal experiment

most experiments

most observational

studies

bad observational

studies

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 42 / 94

Page 95: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data

1 Case study

2 Data basics

3 Overview of data collection principles

4 Observational studies and sampling strategies

5 Experiments

6 Examining numerical dataScatterplots for paired dataDot plots and the meanHistograms and shapeVariance and standard deviationBox plots, quartiles, and the medianRobust statisticsTransforming dataMapping data

7 Considering categorical data

8 Case study: Gender discrimination

OpenIntro Statistics, 2nd Edition

Chp 1: Intro. to data

Page 96: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Scatterplots for paired data

Scatterplot

Scatterplots are useful for visualizing the relationship between twonumerical variables.

Do life expectancy and total fertility appearto be associated or independent?

They appear to be linearly and negativelyassociated: as fertility increases, lifeexpectancy decreases.

Was the relationship the same throughoutthe years, or did it change?

The relationship changed over the years.

http:// www.gapminder.org/ world

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 43 / 94

Page 97: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Scatterplots for paired data

Scatterplot

Scatterplots are useful for visualizing the relationship between twonumerical variables.

Do life expectancy and total fertility appearto be associated or independent?

They appear to be linearly and negativelyassociated: as fertility increases, lifeexpectancy decreases.

Was the relationship the same throughoutthe years, or did it change?

The relationship changed over the years.

http:// www.gapminder.org/ world

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 43 / 94

Page 98: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Scatterplots for paired data

Scatterplot

Scatterplots are useful for visualizing the relationship between twonumerical variables.

Do life expectancy and total fertility appearto be associated or independent?

They appear to be linearly and negativelyassociated: as fertility increases, lifeexpectancy decreases.

Was the relationship the same throughoutthe years, or did it change?

The relationship changed over the years.

http:// www.gapminder.org/ world

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 43 / 94

Page 99: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Dot plots and the mean

Dot plots

Useful for visualizing one numerical variable. Darker colors representareas where there are more observations.

GPA

2.5 3.0 3.5 4.0

How would you describe the distribution of GPAs in this data set? Makesure to say something about the center, shape, and spread of the dis-tribution.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 44 / 94

Page 100: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Dot plots and the mean

Dot plots & mean

GPA

2.5 3.0 3.5 4.0

The mean, also called the average (marked with a triangle in theabove plot), is one way to measure the center of a distribution ofdata.

The mean GPA is 3.59.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 45 / 94

Page 101: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Dot plots and the mean

Mean

The sample mean, denoted as x̄, can be calculated as

x̄ =x1 + x2 + · · · + xn

n,

where x1, x2, · · · , xn represent the n observed values.

The population mean is also computed the same way but isdenoted as µ. It is often not possible to calculate µ sincepopulation data are rarely available.

The sample mean is a sample statistic, and serves as a pointestimate of the population mean. This estimate may not beperfect, but if the sample is good (representative of thepopulation), it is usually a pretty good estimate.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 46 / 94

Page 102: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Dot plots and the mean

Stacked dot plot

Higher bars represent areas where there are more observations,makes it a little easier to judge the center and the shape of thedistribution.

GPA

●● ● ●●●●

● ●●●

●●

●● ● ●●●

●●●

●●

●●

●●●

●●

●● ●

● ●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

● ●

● ● ●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ●●

●●●

●●

2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 47 / 94

Page 103: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Histograms and shape

Histograms - Extracurricular hours

Histograms provide a view of the data density. Higher barsrepresent where the data are relatively more common.Histograms are especially convenient for describing the shape ofthe data distribution.The chosen bin width can alter the story the histogram is telling.

Hours / week spent on extracurricular activities

0 10 20 30 40 50 60 70

0

50

100

150

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 48 / 94

Page 104: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Histograms and shape

Bin width

Which one(s) of these histograms are useful? Which reveal too muchabout the data? Which hide too much?

Hours / week spent on extracurricular activities

0 20 40 60 80 100

0

50

100

150

200

Hours / week spent on extracurricular activities

0 10 20 30 40 50 60 70

0

50

100

150

Hours / week spent on extracurricular activities

0 10 20 30 40 50 60 70

0

20

40

60

80

Hours / week spent on extracurricular activities

0 10 20 30 40 50 60 70

0

10

20

30

40

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 49 / 94

Page 105: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Histograms and shape

Shape of a distribution: modality

Does the histogram have a single prominent peak (unimodal), severalprominent peaks (bimodal/multimodal), or no apparent peaks(uniform)?

0 5 10 15

05

1015

0 5 10 15 20

05

1015

0 5 10 15 20

05

1015

20

0 5 10 15 20

02

46

810

14

Note: In order to determine modality, step back and imagine a smooth curve over the

histogram – imagine that the bars are wooden blocks and you drop a limp spaghetti

over them, the shape the spaghetti would take could be viewed as a smooth curve.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 50 / 94

Page 106: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Histograms and shape

Shape of a distribution: skewness

Is the histogram right skewed, left skewed, or symmetric?

0 2 4 6 8 10

05

1015

0 5 10 15 20 25

020

4060

0 20 40 60 80

05

1015

2025

30

Note: Histograms are said to be skewed to the side of the long tail.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 51 / 94

Page 107: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Histograms and shape

Shape of a distribution: unusual observations

Are there any unusual observations or potential outliers?

0 5 10 15 20

05

1015

2025

30

20 40 60 80 100

010

2030

40

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 52 / 94

Page 108: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Histograms and shape

Extracurricular activities

How would you describe the shape of the distribution of hours per weekstudents spend on extracurricular activities?

Hours / week spent on extracurricular activities

0 10 20 30 40 50 60 70

0

50

100

150

Unimodal and right skewed, with a potentially unusual observation at60 hours/week.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 53 / 94

Page 109: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Histograms and shape

Extracurricular activities

How would you describe the shape of the distribution of hours per weekstudents spend on extracurricular activities?

Hours / week spent on extracurricular activities

0 10 20 30 40 50 60 70

0

50

100

150

Unimodal and right skewed, with a potentially unusual observation at60 hours/week.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 53 / 94

Page 110: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Histograms and shape

Commonly observed shapes of distributions

modality

unimodal bimodal multimodaluniform

skewness

right skew left skewsymmetric

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 54 / 94

Page 111: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Histograms and shape

Commonly observed shapes of distributions

modality

unimodal

bimodal multimodaluniform

skewness

right skew left skewsymmetric

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 54 / 94

Page 112: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Histograms and shape

Commonly observed shapes of distributions

modality

unimodal bimodal

multimodaluniform

skewness

right skew left skewsymmetric

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 54 / 94

Page 113: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Histograms and shape

Commonly observed shapes of distributions

modality

unimodal bimodal multimodal

uniform

skewness

right skew left skewsymmetric

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 54 / 94

Page 114: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Histograms and shape

Commonly observed shapes of distributions

modality

unimodal bimodal multimodaluniform

skewness

right skew left skewsymmetric

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 54 / 94

Page 115: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Histograms and shape

Commonly observed shapes of distributions

modality

unimodal bimodal multimodaluniform

skewness

right skew left skewsymmetric

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 54 / 94

Page 116: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Histograms and shape

Commonly observed shapes of distributions

modality

unimodal bimodal multimodaluniform

skewness

right skew

left skewsymmetric

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 54 / 94

Page 117: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Histograms and shape

Commonly observed shapes of distributions

modality

unimodal bimodal multimodaluniform

skewness

right skew left skew

symmetric

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 54 / 94

Page 118: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Histograms and shape

Commonly observed shapes of distributions

modality

unimodal bimodal multimodaluniform

skewness

right skew left skewsymmetric

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 54 / 94

Page 119: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Histograms and shape

Practice

Which of these variables do you expect to be uniformly distributed?

(a) weights of adult females

(b) salaries of a random sample of people from North Carolina

(c) house prices

(d) birthdays of classmates (day of the month)

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 55 / 94

Page 120: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Histograms and shape

Practice

Which of these variables do you expect to be uniformly distributed?

(a) weights of adult females

(b) salaries of a random sample of people from North Carolina

(c) house prices

(d) birthdays of classmates (day of the month)

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 55 / 94

Page 121: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Histograms and shape

Application activity: Shapes of distributions

Sketch the expected distributions of the following variables:

number of piercings

scores on an exam

IQ scores

Come up with a concise way (1-2 sentences) to teach someone howto determine the expected distribution of any variable.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 56 / 94

Page 122: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Histograms and shape

Are you typical?

http:// www.youtube.com/ watch?v=4B2xOvKFFz4

How useful are centers alone for conveying the true characteristics ofa distribution?

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 57 / 94

Page 123: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Histograms and shape

Are you typical?

http:// www.youtube.com/ watch?v=4B2xOvKFFz4

How useful are centers alone for conveying the true characteristics ofa distribution?

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 57 / 94

Page 124: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Variance and standard deviation

Variance

Variance is roughly the average squared deviation from the mean.

s2 =

∑ni=1(xi − x̄)2

n − 1

The sample mean is x̄ = 6.71,and the sample size isn = 217.

The variance of amount ofsleep students get per nightcan be calculated as:

Hours of sleep / night

2 4 6 8 10 12

0

20

40

60

80

s2 =(5 − 6.71)2 + (9 − 6.71)2 + · · · + (7 − 6.71)2

217 − 1= 4.11 hours2

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 58 / 94

Page 125: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Variance and standard deviation

Variance

Variance is roughly the average squared deviation from the mean.

s2 =

∑ni=1(xi − x̄)2

n − 1

The sample mean is x̄ = 6.71,and the sample size isn = 217.

The variance of amount ofsleep students get per nightcan be calculated as:

Hours of sleep / night

2 4 6 8 10 12

0

20

40

60

80

s2 =(5 − 6.71)2 + (9 − 6.71)2 + · · · + (7 − 6.71)2

217 − 1= 4.11 hours2

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 58 / 94

Page 126: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Variance and standard deviation

Variance

Variance is roughly the average squared deviation from the mean.

s2 =

∑ni=1(xi − x̄)2

n − 1

The sample mean is x̄ = 6.71,and the sample size isn = 217.

The variance of amount ofsleep students get per nightcan be calculated as:

Hours of sleep / night

2 4 6 8 10 12

0

20

40

60

80

s2 =(5 − 6.71)2 + (9 − 6.71)2 + · · · + (7 − 6.71)2

217 − 1= 4.11 hours2

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 58 / 94

Page 127: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Variance and standard deviation

Variance (cont.)

Why do we use the squared deviation in the calculation of variance?

To get rid of negatives so that observations equally distant fromthe mean are weighed equally.

To weigh larger deviations more heavily.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 59 / 94

Page 128: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Variance and standard deviation

Variance (cont.)

Why do we use the squared deviation in the calculation of variance?

To get rid of negatives so that observations equally distant fromthe mean are weighed equally.

To weigh larger deviations more heavily.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 59 / 94

Page 129: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Variance and standard deviation

Standard deviation

The standard deviation is the square root of the variance, and has thesame units as the data.s

s =√

s2

The standard deviation ofamount of sleep students getper night can be calculatedas:

s =√

4.11 = 2.03 hours

We can see that all of thedata are within 3 standarddeviations of the mean.

Hours of sleep / night

2 4 6 8 10 12

0

20

40

60

80

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 60 / 94

Page 130: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Variance and standard deviation

Standard deviation

The standard deviation is the square root of the variance, and has thesame units as the data.s

s =√

s2

The standard deviation ofamount of sleep students getper night can be calculatedas:

s =√

4.11 = 2.03 hours

We can see that all of thedata are within 3 standarddeviations of the mean.

Hours of sleep / night

2 4 6 8 10 12

0

20

40

60

80

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 60 / 94

Page 131: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Variance and standard deviation

Standard deviation

The standard deviation is the square root of the variance, and has thesame units as the data.s

s =√

s2

The standard deviation ofamount of sleep students getper night can be calculatedas:

s =√

4.11 = 2.03 hours

We can see that all of thedata are within 3 standarddeviations of the mean.

Hours of sleep / night

2 4 6 8 10 12

0

20

40

60

80

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 60 / 94

Page 132: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Box plots, quartiles, and the median

Median

The median is the value that splits the data in half when orderedin ascending order.

0, 1, 2, 3, 4

If there are an even number of observations, then the median isthe average of the two values in the middle.

0, 1, 2, 3, 4, 5→2 + 3

2= 2.5

Since the median is the midpoint of the data, 50% of the valuesare below it. Hence, it is also the 50th percentile.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 61 / 94

Page 133: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Box plots, quartiles, and the median

Q1, Q3, and IQR

The 25th percentile is also called the first quartile, Q1.

The 50th percentile is also called the median.

The 75th percentile is also called the third quartile, Q3.

Between Q1 and Q3 is the middle 50% of the data. Therange these data span is called the interquartile range, or the IQR.

IQR = Q3 − Q1

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 62 / 94

Page 134: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Box plots, quartiles, and the median

Box plot

The box in a box plot represents the middle 50% of the data, and thethick line in the box is the median.

# of

stu

dy h

ours

/ w

eek

0

10

20

30

40

50

60

70

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 63 / 94

Page 135: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Box plots, quartiles, and the median

Anatomy of a box plot#

of s

tudy

hou

rs /

wee

k

0

10

20

30

40

50

60

70

lower whisker

Q1 (first quartile)

median

Q3 (third quartile)

max whisker reach& upper whisker

suspected outliers

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 64 / 94

Page 136: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Box plots, quartiles, and the median

Whiskers and outliers

Whiskers of a box plot can extend up to 1.5 × IQR away from thequartiles.

max upper whisker reach = Q3 + 1.5 × IQR

max lower whisker reach = Q1 − 1.5 × IQR

IQR : 20 − 10 = 10max upper whisker reach = 20 + 1.5 × 10 = 35max lower whisker reach = 10 − 1.5 × 10 = −5

A potential outlier is defined as an observation beyond themaximum reach of the whiskers. It is an observation thatappears extreme relative to the rest of the data.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 65 / 94

Page 137: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Box plots, quartiles, and the median

Whiskers and outliers

Whiskers of a box plot can extend up to 1.5 × IQR away from thequartiles.

max upper whisker reach = Q3 + 1.5 × IQR

max lower whisker reach = Q1 − 1.5 × IQR

IQR : 20 − 10 = 10max upper whisker reach = 20 + 1.5 × 10 = 35max lower whisker reach = 10 − 1.5 × 10 = −5

A potential outlier is defined as an observation beyond themaximum reach of the whiskers. It is an observation thatappears extreme relative to the rest of the data.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 65 / 94

Page 138: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Box plots, quartiles, and the median

Whiskers and outliers

Whiskers of a box plot can extend up to 1.5 × IQR away from thequartiles.

max upper whisker reach = Q3 + 1.5 × IQR

max lower whisker reach = Q1 − 1.5 × IQR

IQR : 20 − 10 = 10max upper whisker reach = 20 + 1.5 × 10 = 35max lower whisker reach = 10 − 1.5 × 10 = −5

A potential outlier is defined as an observation beyond themaximum reach of the whiskers. It is an observation thatappears extreme relative to the rest of the data.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 65 / 94

Page 139: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Box plots, quartiles, and the median

Outliers (cont.)

Why is it important to look for outliers?

Identify extreme skew in the distribution.

Identify data collection and entry errors.

Provide insight into interesting features of the data.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 66 / 94

Page 140: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Box plots, quartiles, and the median

Outliers (cont.)

Why is it important to look for outliers?

Identify extreme skew in the distribution.

Identify data collection and entry errors.

Provide insight into interesting features of the data.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 66 / 94

Page 141: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Robust statistics

Extreme observations

How would sample statistics such as mean, median, SD, and IQR ofhousehold income be affected if the largest value was replaced with$10 million? What if the smallest value was replaced with $10 million?

Annual Household Income

●● ●●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●●

● ●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

● ●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

● ●

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 67 / 94

Page 142: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Robust statistics

Robust statistics

Annual Household Income

●● ●●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●●

● ●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

● ●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

● ●

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

robust not robustscenario median IQR x̄ soriginal data 190K 200K 245K 226Kmove largest to $10 million 190K 200K 309K 853Kmove smallest to $10 million 200K 200K 316K 854K

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 68 / 94

Page 143: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Robust statistics

Robust statistics

Median and IQR are more robust to skewness and outliers than meanand SD. Therefore,

for skewed distributions it is often more helpful to use medianand IQR to describe the center and spread

for symmetric distributions it is often more helpful to use themean and SD to describe the center and spread

If you would like to estimate the typical household income for a student,would you be more interested in the mean or median income?

Median

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 69 / 94

Page 144: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Robust statistics

Robust statistics

Median and IQR are more robust to skewness and outliers than meanand SD. Therefore,

for skewed distributions it is often more helpful to use medianand IQR to describe the center and spread

for symmetric distributions it is often more helpful to use themean and SD to describe the center and spread

If you would like to estimate the typical household income for a student,would you be more interested in the mean or median income?

Median

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 69 / 94

Page 145: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Robust statistics

Robust statistics

Median and IQR are more robust to skewness and outliers than meanand SD. Therefore,

for skewed distributions it is often more helpful to use medianand IQR to describe the center and spread

for symmetric distributions it is often more helpful to use themean and SD to describe the center and spread

If you would like to estimate the typical household income for a student,would you be more interested in the mean or median income?

Median

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 69 / 94

Page 146: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Robust statistics

Mean vs. median

If the distribution is symmetric, center is often defined as themean: mean ≈ median

Symmetric

meanmedian

If the distribution is skewed or has extreme outliers, center isoften defined as the median

Right-skewed: mean > medianLeft-skewed: mean < median

Right−skewed

meanmedian

Left−skewed

meanmedian

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 70 / 94

Page 147: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Robust statistics

Practice

Which is most likely true for the distribution of percentage of time actuallyspent taking notes in class versus on Facebook, Twitter, etc.?

% of time in class spent taking notes

0 20 40 60 80 100

0

10

20

30

40

50

(a) mean> median

(b) mean < median

(c) mean ≈ median

(d) impossible to tell

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 71 / 94

Page 148: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Robust statistics

Practice

Which is most likely true for the distribution of percentage of time actuallyspent taking notes in class versus on Facebook, Twitter, etc.?

% of time in class spent taking notes

0 20 40 60 80 100

0

10

20

30

40

50 median: 80%mean: 76%

(a) mean> median

(b) mean < median

(c) mean ≈ median

(d) impossible to tell

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 71 / 94

Page 149: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Transforming data

Extremely skewed data

When data are extremely skewed, transforming them might makemodeling easier. A common transformation is the log transformation.

The histograms on the left shows the distribution of number ofbasketball games attended by students. The histogram on the rightshows the distribution of log of number of games attended.

# of basketball games attended

0 10 20 30 40 50 60 70

0

50

100

150

# of basketball games attended

0 1 2 3 4

0

10

20

30

40

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 72 / 94

Page 150: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Transforming data

Extremely skewed data

When data are extremely skewed, transforming them might makemodeling easier. A common transformation is the log transformation.

The histograms on the left shows the distribution of number ofbasketball games attended by students. The histogram on the rightshows the distribution of log of number of games attended.

# of basketball games attended

0 10 20 30 40 50 60 70

0

50

100

150

# of basketball games attended

0 1 2 3 4

0

10

20

30

40

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 72 / 94

Page 151: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Transforming data

Pros and cons of transformations

Skewed data are easier to model with when they are transformedbecause outliers tend to become far less prominent after anappropriate transformation.

# of games 70 50 25 · · ·

log(# of games) 4.25 3.91 3.22 · · ·

However, results of an analysis might be difficult to interpretbecause the log of a measured variable is usually meaningless.

What other variables would you expect to be extremely skewed?

Salary, housing prices, etc.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 73 / 94

Page 152: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Transforming data

Pros and cons of transformations

Skewed data are easier to model with when they are transformedbecause outliers tend to become far less prominent after anappropriate transformation.

# of games 70 50 25 · · ·

log(# of games) 4.25 3.91 3.22 · · ·

However, results of an analysis might be difficult to interpretbecause the log of a measured variable is usually meaningless.

What other variables would you expect to be extremely skewed?

Salary, housing prices, etc.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 73 / 94

Page 153: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Transforming data

Pros and cons of transformations

Skewed data are easier to model with when they are transformedbecause outliers tend to become far less prominent after anappropriate transformation.

# of games 70 50 25 · · ·

log(# of games) 4.25 3.91 3.22 · · ·

However, results of an analysis might be difficult to interpretbecause the log of a measured variable is usually meaningless.

What other variables would you expect to be extremely skewed?

Salary, housing prices, etc.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 73 / 94

Page 154: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Examining numerical data Mapping data

Intensity maps

What patterns are apparent in the change in population between 2000and 2010?

http:// projects.nytimes.com/ census/ 2010/ map

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 74 / 94

Page 155: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Considering categorical data

1 Case study

2 Data basics

3 Overview of data collection principles

4 Observational studies and sampling strategies

5 Experiments

6 Examining numerical data

7 Considering categorical dataContingency tables and bar plotsRow and column proportionsSegmented bar and mosaic plotsPie chartsComparing numerical data across groups

8 Case study: Gender discrimination

OpenIntro Statistics, 2nd Edition

Chp 1: Intro. to data

Page 156: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Considering categorical data Contingency tables and bar plots

Contingency tables

A table that summarizes data for two categorical variables is called acontingency table.

The contingency table below shows the distribution of students’genders and whether or not they are looking for a spouse while incollege.

looking for spouseNo Yes Total

genderFemale 86 51 137Male 52 18 70Total 138 69 207

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 75 / 94

Page 157: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Considering categorical data Contingency tables and bar plots

Contingency tables

A table that summarizes data for two categorical variables is called acontingency table.

The contingency table below shows the distribution of students’genders and whether or not they are looking for a spouse while incollege.

looking for spouseNo Yes Total

genderFemale 86 51 137Male 52 18 70Total 138 69 207

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 75 / 94

Page 158: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Considering categorical data Contingency tables and bar plots

Bar plots

A bar plot is a common way to display a single categorical variable. Abar plot where proportions instead of frequencies are shown is calleda relative frequency bar plot.

Female Male0

20

40

60

80

100

120

Female Male0.0

0.1

0.2

0.3

0.4

0.5

0.6

How are bar plots different than histograms?

Bar plots are used for displaying distributions of categorical variables, while histograms are

used for numerical variables. The x-axis in a histogram is a number line, hence the order of the

bars cannot be changed, while in a bar plot the categories can be listed in any order (though

some orderings make more sense than others, especially for ordinal variables.)

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 76 / 94

Page 159: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Considering categorical data Contingency tables and bar plots

Bar plots

A bar plot is a common way to display a single categorical variable. Abar plot where proportions instead of frequencies are shown is calleda relative frequency bar plot.

Female Male0

20

40

60

80

100

120

Female Male0.0

0.1

0.2

0.3

0.4

0.5

0.6

How are bar plots different than histograms?

Bar plots are used for displaying distributions of categorical variables, while histograms are

used for numerical variables. The x-axis in a histogram is a number line, hence the order of the

bars cannot be changed, while in a bar plot the categories can be listed in any order (though

some orderings make more sense than others, especially for ordinal variables.)

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 76 / 94

Page 160: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Considering categorical data Contingency tables and bar plots

Bar plots

A bar plot is a common way to display a single categorical variable. Abar plot where proportions instead of frequencies are shown is calleda relative frequency bar plot.

Female Male0

20

40

60

80

100

120

Female Male0.0

0.1

0.2

0.3

0.4

0.5

0.6

How are bar plots different than histograms?

Bar plots are used for displaying distributions of categorical variables, while histograms are

used for numerical variables. The x-axis in a histogram is a number line, hence the order of the

bars cannot be changed, while in a bar plot the categories can be listed in any order (though

some orderings make more sense than others, especially for ordinal variables.)

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 76 / 94

Page 161: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Considering categorical data Row and column proportions

Choosing the appropriate proportion

Does there appear to be a relationship between gender and whetherthe student is looking for a spouse in college?

looking for spouseNo Yes Total

genderFemale 86 51 137Male 52 18 70Total 138 69 207

To answer this question we examine the row proportions:

% Females looking for a spouse: 51/137 ≈ 0.37

% Males looking for a spouse: 18/70 ≈ 0.26

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 77 / 94

Page 162: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Considering categorical data Row and column proportions

Choosing the appropriate proportion

Does there appear to be a relationship between gender and whetherthe student is looking for a spouse in college?

looking for spouseNo Yes Total

genderFemale 86 51 137Male 52 18 70Total 138 69 207

To answer this question we examine the row proportions:

% Females looking for a spouse: 51/137 ≈ 0.37

% Males looking for a spouse: 18/70 ≈ 0.26

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 77 / 94

Page 163: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Considering categorical data Row and column proportions

Choosing the appropriate proportion

Does there appear to be a relationship between gender and whetherthe student is looking for a spouse in college?

looking for spouseNo Yes Total

genderFemale 86 51 137Male 52 18 70Total 138 69 207

To answer this question we examine the row proportions:

% Females looking for a spouse: 51/137 ≈ 0.37

% Males looking for a spouse: 18/70 ≈ 0.26

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 77 / 94

Page 164: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Considering categorical data Row and column proportions

Choosing the appropriate proportion

Does there appear to be a relationship between gender and whetherthe student is looking for a spouse in college?

looking for spouseNo Yes Total

genderFemale 86 51 137Male 52 18 70Total 138 69 207

To answer this question we examine the row proportions:

% Females looking for a spouse: 51/137 ≈ 0.37

% Males looking for a spouse: 18/70 ≈ 0.26

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 77 / 94

Page 165: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Considering categorical data Segmented bar and mosaic plots

Segmented bar and mosaic plots

What are the differences between the three visualizations shown be-low?

Female Male

YesNo

0

20

40

60

80

100

120

Female Male0.0

0.2

0.4

0.6

0.8

1.0 Female Male

No

Yes

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 78 / 94

Page 166: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Considering categorical data Pie charts

Pie charts

Can you tell which order encompasses the lowest percentage of mam-mal species?

RODENTIACHIROPTERACARNIVORAARTIODACTYLAPRIMATESSORICOMORPHALAGOMORPHADIPROTODONTIADIDELPHIMORPHIACETACEADASYUROMORPHIAAFROSORICIDAERINACEOMORPHASCANDENTIAPERISSODACTYLAHYRACOIDEAPERAMELEMORPHIACINGULATAPILOSAMACROSCELIDEATUBULIDENTATAPHOLIDOTAMONOTREMATAPAUCITUBERCULATASIRENIAPROBOSCIDEADERMOPTERANOTORYCTEMORPHIAMICROBIOTHERIA

Data from http:// www.bucknell.edu/ msw3.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 79 / 94

Page 167: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Considering categorical data Comparing numerical data across groups

Side-by-side box plots

Does there appear to be a relationship between class year and numberof clubs students are in?

First−year Sophomore Junior Senior

0

2

4

6

8

●●

●●

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 80 / 94

Page 168: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study: Gender discrimination

1 Case study

2 Data basics

3 Overview of data collection principles

4 Observational studies and sampling strategies

5 Experiments

6 Examining numerical data

7 Considering categorical data

8 Case study: Gender discriminationStudy description and dataCompeting claimsTesting via simulationChecking for independence

OpenIntro Statistics, 2nd Edition

Chp 1: Intro. to data

Page 169: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study: Gender discrimination Study description and data

Gender discrimination

In 1972, as a part of a study on gender discrimination, 48 malebank supervisors were each given the same personnel file andasked to judge whether the person should be promoted to abranch manager job that was described as “routine”.The files were identical except that half of the supervisors hadfiles showing the person was male while the other half had filesshowing the person was female.It was randomly determined which supervisors got “male”applications and which got “female” applications.Of the 48 files reviewed, 35 were promoted.The study is testing whether females are unfairly discriminatedagainst.

Is this an observational study or an experiment?

Experiment

B.Rosen and T. Jerdee (1974), “Influence of sex role stereotypes on personnel decisions”, J.Applied Psychology, 59:9-14.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 81 / 94

Page 170: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study: Gender discrimination Study description and data

Gender discrimination

In 1972, as a part of a study on gender discrimination, 48 malebank supervisors were each given the same personnel file andasked to judge whether the person should be promoted to abranch manager job that was described as “routine”.The files were identical except that half of the supervisors hadfiles showing the person was male while the other half had filesshowing the person was female.It was randomly determined which supervisors got “male”applications and which got “female” applications.Of the 48 files reviewed, 35 were promoted.The study is testing whether females are unfairly discriminatedagainst.

Is this an observational study or an experiment?

ExperimentB.Rosen and T. Jerdee (1974), “Influence of sex role stereotypes on personnel decisions”, J.Applied Psychology, 59:9-14.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 81 / 94

Page 171: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study: Gender discrimination Study description and data

Data

At a first glance, does there appear to be a relatonship between pro-motion and gender?

PromotionPromoted Not Promoted Total

GenderMale 21 3 24Female 14 10 24Total 35 13 48

% of males promoted: 21/24 = 0.875% of females promoted: 14/24 = 0.583

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 82 / 94

Page 172: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study: Gender discrimination Study description and data

Data

At a first glance, does there appear to be a relatonship between pro-motion and gender?

PromotionPromoted Not Promoted Total

GenderMale 21 3 24Female 14 10 24Total 35 13 48

% of males promoted: 21/24 = 0.875% of females promoted: 14/24 = 0.583

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 82 / 94

Page 173: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study: Gender discrimination Study description and data

Practice

We saw a difference of almost 30% (29.2% to be exact) between theproportion of male and female files that are promoted. Based on thisinformation, which of the below is true?

(a) If we were to repeat the experiment we will definitely see thatmore female files get promoted. This was a fluke.

(b) Promotion is dependent on gender, males are more likely to bepromoted, and hence there is gender discrimination againstwomen in promotion decisions.

(c) The difference in the proportions of promoted male and femalefiles is due to chance, this is not evidence of genderdiscrimination against women in promotion decisions.

(d) Women are less qualified than men, and this is why fewer femalesget promoted.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 83 / 94

Page 174: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study: Gender discrimination Study description and data

Practice

We saw a difference of almost 30% (29.2% to be exact) between theproportion of male and female files that are promoted. Based on thisinformation, which of the below is true?

(a) If we were to repeat the experiment we will definitely see thatmore female files get promoted. This was a fluke.

(b) Promotion is dependent on gender, males are more likely to bepromoted, and hence there is gender discrimination againstwomen in promotion decisions. Maybe

(c) The difference in the proportions of promoted male and femalefiles is due to chance, this is not evidence of genderdiscrimination against women in promotion decisions. Maybe

(d) Women are less qualified than men, and this is why fewer femalesget promoted.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 83 / 94

Page 175: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study: Gender discrimination Competing claims

Two competing claims

1. “There is nothing going on.”Promotion and gender are independent, no genderdiscrimination, observed difference in proportions is simply dueto chance. → Null hypothesis

2. “There is something going on.”Promotion and gender are dependent, there is genderdiscrimination, observed difference in proportions is not due tochance. → Alternative hypothesis

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 84 / 94

Page 176: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study: Gender discrimination Competing claims

Two competing claims

1. “There is nothing going on.”Promotion and gender are independent, no genderdiscrimination, observed difference in proportions is simply dueto chance. → Null hypothesis

2. “There is something going on.”Promotion and gender are dependent, there is genderdiscrimination, observed difference in proportions is not due tochance. → Alternative hypothesis

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 84 / 94

Page 177: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study: Gender discrimination Competing claims

A trial as a hypothesis test

Hypothesis testing is verymuch like a court trial.

H0: Defendant is innocentHA: Defendant is guilty

We then present the evidence- collect data.

Then we judge the evidence - “Could these data plausibly havehappened by chance if the null hypothesis were true?”

If they were very unlikely to have occurred, then the evidenceraises more than a reasonable doubt in our minds about the nullhypothesis.

Ultimately we must make a decision. How unlikely is unlikely?

Image from http:// www.nwherald.com/ internal/ cimg!0/ oo1il4sf8zzaqbboq25oevvbg99wpot .

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 85 / 94

Page 178: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study: Gender discrimination Competing claims

A trial as a hypothesis test (cont.)

If the evidence is not strong enough to reject the assumption ofinnocence, the jury returns with a verdict of “not guilty”.

The jury does not say that the defendant is innocent, just thatthere is not enough evidence to convict.The defendant may, in fact, be innocent, but the jury has no wayof being sure.

Said statistically, we fail to reject the null hypothesis.We never declare the null hypothesis to be true, because wesimply do not know whether it’s true or not.Therefore we never “accept the null hypothesis”.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 86 / 94

Page 179: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study: Gender discrimination Competing claims

A trial as a hypothesis test (cont.)

In a trial, the burden of proof is on the prosecution.

In a hypothesis test, the burden of proof is on the unusual claim.

The null hypothesis is the ordinary state of affairs (the statusquo), so it’s the alternative hypothesis that we consider unusualand for which we must gather evidence.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 87 / 94

Page 180: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study: Gender discrimination Competing claims

Recap: hypothesis testing framework

We start with a null hypothesis (H0) that represents the statusquo.

We also have an alternative hypothesis (HA) that represents ourresearch question, i.e. what we’re testing for.

We conduct a hypothesis test under the assumption that the nullhypothesis is true, either via simulation (today) or theoreticalmethods (later in the course).

If the test results suggest that the data do not provide convincingevidence for the alternative hypothesis, we stick with the nullhypothesis. If they do, then we reject the null hypothesis in favorof the alternative.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 88 / 94

Page 181: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study: Gender discrimination Testing via simulation

Simulating the experiment...

... under the assumption of independence, i.e. leave things up tochance.

If results from the simulations based on the chance model look likethe data, then we can determine that the difference between theproportions of promoted files between males and females was simplydue to chance (promotion and gender are independent).

If the results from the simulations based on the chance model do notlook like the data, then we can determine that the difference betweenthe proportions of promoted files between males and females was notdue to chance, but due to an actual effect of gender (promotion andgender are dependent).

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 89 / 94

Page 182: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study: Gender discrimination Testing via simulation

Application activity: simulating the experiment

Use a deck of playing cards to simulate this experiment.

1. Let a face card represent not promoted and a non-face cardrepresent a promoted. Consider aces as face cards.

Set aside the jokers.Take out 3 aces→ there are exactly 13 face cards left in the deck(face cards: A, K, Q, J).Take out a number card→ there are exactly 35 number(non-face) cards left in the deck (number cards: 2-10).

2. Shuffle the cards and deal them intro two groups of size 24,representing males and females.

3. Count and record how many files in each group are promoted(number cards).

4. Calculate the proportion of promoted files in each group and takethe difference (male - female), and record this value.

5. Repeat steps 2 - 4 many times.OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 90 / 94

Page 183: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study: Gender discrimination Testing via simulation

Step 1

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 91 / 94

Page 184: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study: Gender discrimination Testing via simulation

Step 2 - 4

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 92 / 94

Page 185: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study: Gender discrimination Checking for independence

Practice

Do the results of the simulation you just ran provide convincing evi-dence of gender discrimination against women, i.e. dependence be-tween gender and promotion decisions?

(a) No, the data do not provide convincing evidence for thealternative hypothesis, therefore we can’t reject the nullhypothesis of independence between gender and promotiondecisions. The observed difference between the two proportionswas due to chance.

(b) Yes, the data provide convincing evidence for the alternativehypothesis of gender discrimination against women in promotiondecisions. The observed difference between the two proportionswas due to a real effect of gender.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 93 / 94

Page 186: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study: Gender discrimination Checking for independence

Practice

Do the results of the simulation you just ran provide convincing evi-dence of gender discrimination against women, i.e. dependence be-tween gender and promotion decisions?

(a) No, the data do not provide convincing evidence for thealternative hypothesis, therefore we can’t reject the nullhypothesis of independence between gender and promotiondecisions. The observed difference between the two proportionswas due to chance.

(b) Yes, the data provide convincing evidence for the alternativehypothesis of gender discrimination against women in promotiondecisions. The observed difference between the two proportionswas due to a real effect of gender.

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 93 / 94

Page 187: Chapter 1: Introduction to datasite.iugaza.edu.ps › mriffi › files › 2016 › 02 › os2_slides_01.pdf · outcomes at 6-month follow-up. Note that 7 patients dropped out of

Case study: Gender discrimination Checking for independence

Simulations using software

These simulations are tedious and slow to run using the methoddescribed earlier. In reality, we use software to generate thesimulations. The dot plot below shows the distribution of simulateddifferences in promotion rates based on 100 simulations.

●●

● ●

●●

Difference in promotion rates

−0.4 −0.2 0 0.2 0.4

OpenIntro Statistics, 2nd Edition Chp 1: Intro. to data 94 / 94