DescribingData:$ Two$Variables SAT:(μ(=(1500,(σ=325(9/14/15 2 Statistics:UnlockingthePowerofData (...

6
9/14/15 1 Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Describing Data: Two Variables SECTIONS 2.4, 2.5 One quantitative variable (2.4) One quantitative and one categorical (2.4) Two quantitative (2.5) Statistics: Unlocking the Power of Data Lock 5 zscore Which is better, an ACT score of 28 or a combined SAT score of 2100? ACT: μ = 21, σ = 5 SAT: μ = 1500, σ = 325 Assume ACT and SAT scores have approximately bellshaped distributions a) ACT score of 28 b) SAT score of 2100 c) I don’t know Statistics: Unlocking the Power of Data Lock 5 Honeybee Waggle Dances https://www.youtube.com/watch?v=7ijIg4jHg Statistics: Unlocking the Power of Data Lock 5 Honeybee Waggle Dance Honeybee scouts investigate new home or food source options; the scouts communicate the information to the hive with a “waggle dance” Scientists took bees to an island with only two possible options for nesting: one of very high quality and one of low quality. They recorded Quality of nesting site Distance to nesting site Number of waggle dance circuits performed Duration of waggle dance Seeley, T., Honeybee Democracy, Princeton University Press, Princeton, NJ, 2010, p. 128 Statistics: Unlocking the Power of Data Lock 5 Questions of the Day How many circuits of the waggle dance do honey bees do? How is this related to quality of a nesting site? How is duration of the dance related to distance to a nesting site? Statistics: Unlocking the Power of Data Lock 5 Other Measures of Location Maximum = largest data value Minimum = smallest data value Quartiles: Q 1 = median of the values below m. Q 3 = median of the values above m.

Transcript of DescribingData:$ Two$Variables SAT:(μ(=(1500,(σ=325(9/14/15 2 Statistics:UnlockingthePowerofData (...

Page 1: DescribingData:$ Two$Variables SAT:(μ(=(1500,(σ=325(9/14/15 2 Statistics:UnlockingthePowerofData ( (((Lock 5( FiveNumberSummary$! Five0Number0Summary: 0 (((Min 0 Q 10 m 0 Q 30 Max

9/14/15  

1  

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

STAT  250  Dr.  Kari  Lock  Morgan  

Describing  Data:  Two  Variables  

SECTIONS  2.4,  2.5  •  One  quantitative  variable  (2.4)  •  One  quantitative  and  one  categorical  (2.4)  •  Two  quantitative  (2.5)  

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

z-­‐score  

Which  is  better,  an  ACT  score  of  28  or  a  combined  SAT  score  of  2100?  �  ACT:  μ  =  21,  σ  =  5  �  SAT:  μ  =  1500,  σ  =  325  �  Assume  ACT  and  SAT  scores  have  approximately  bell-­‐shaped  distributions  

a)   ACT  score  of  28  b)   SAT  score  of  2100  c)     I  don’t  know  

 

 

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Honeybee  Waggle  Dances  

https://www.youtube.com/watch?v=-­‐7ijI-­‐g4jHg  Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Honeybee  Waggle  Dance  � Honeybee  scouts  investigate  new  home  or  food  source  options;  the  scouts  communicate  the  information  to  the  hive  with  a  “waggle  dance”  

� Scientists  took  bees  to  an  island  with  only  two  possible  options  for  nesting:  one  of  very  high  quality  and  one  of  low  quality.    They  recorded  ¡ Quality  of  nesting  site  ¡ Distance  to  nesting  site  ¡ Number  of  waggle  dance  circuits  performed  ¡ Duration  of  waggle  dance  

Seeley,  T.,  Honeybee  Democracy,  Princeton  University  Press,  Princeton,  NJ,  2010,  p.  128  

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Questions  of  the  Day  How  many  circuits  of  the  waggle  dance  

do  honey  bees  do?  

How  is  this  related  to  quality  of  a  nesting  site?    

How  is  duration  of  the  dance  related  to  distance  to  a  nesting  site?  

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Other  Measures  of  Location  

Maximum  =  largest  data  value  

Minimum  =  smallest  data  value  

Quartiles:    Q1  =  median  of  the  values  below  m.    Q3  =  median  of  the  values  above  m.  

Page 2: DescribingData:$ Two$Variables SAT:(μ(=(1500,(σ=325(9/14/15 2 Statistics:UnlockingthePowerofData ( (((Lock 5( FiveNumberSummary$! Five0Number0Summary: 0 (((Min 0 Q 10 m 0 Q 30 Max

9/14/15  

2  

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Five  Number  Summary  �  Five  Number  Summary:  

   

 

 

Min   Max  Q1   Q3  m  

25%   25%   25%   25%  

Minitab: Stat -> Basic Statistics -> Display Descriptive Statistics

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Five  Number  Summary  

The  distribution  of  number  of  circuits  is  

a)  Symmetric  b)  Right-­‐skewed  c)  Left-­‐skewed  d)  Impossible  to  tell  

 

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Percentile  

The  Pth  percentile  is  the  value  which  is  greater  than  P%  of  the  data  

� We  already  used  z-­‐scores  to  determine  whether  an  SAT  score  of  2100  or  an  ACT  score  of  28  is  better  

� We  could  also  have  used  percentiles:  ¡ ACT  score  of  28:  91st  percentile  ¡ SAT  score  of  2100:  97th  percentile  

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Five  Number  Summary  �  Five  Number  Summary:  

   

 

 

Min   Max  Q1   Q3  m  

0th  percentile  

100th  percentile  

50th  percentile  

75th  percentile  

25th  percentile  

25%   25%   25%   25%  

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Measures  of  Spread  

� Range  =  Max  –  Min  

� Interquartile  Range  (IQR)  =  Q3  –  Q1  �  Is  the  range  resistant  to  outliers?  a)  Yes  b)  No  

 

�  Is  the  IQR  resistant  to  outliers?  a)  Yes  b)  No  

 

 

 

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Comparing  Statistics  � Measures  of  Center:  

¡ Mean  (not  resistant)  ¡ Median  (resistant)  

� Measures  of  Spread:  ¡ Standard  deviation  (not  resistant)  ¡  IQR  (resistant)  ¡ Range  (not  resistant)  

� Most  often,  we  use  the  mean  and  the  standard  deviation,  because  they  are  calculated  based  on  all  the  data  values,  so  use  all  the  available  information  

Page 3: DescribingData:$ Two$Variables SAT:(μ(=(1500,(σ=325(9/14/15 2 Statistics:UnlockingthePowerofData ( (((Lock 5( FiveNumberSummary$! Five0Number0Summary: 0 (((Min 0 Q 10 m 0 Q 30 Max

9/14/15  

3  

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Boxplot  

Median  Q1  

Q3  

Lines  (“whiskers”)  extend  from  each  quartile  to  the  most  extreme  value  that  is  not  an  outlier  

Minitab: Graph -> Boxplot -> One Y -> Simple

Middle  50%  of  data  

Outlier  

Outlier  

*For  boxplots,  outliers  are  dejined  as  any  point  more  than  1.5  IQRs  beyond  the  quartiles  (although  you  don’t  have  to  know  that)  

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Boxplot  

This  boxplot  shows  a  distribution  that  is  

a)   Symmetric  b)   Left-­‐skewed  c)   Right-­‐skewed  

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

One  Quantitative  and  One  Categorical  

� How  is  number  of  waggle  circuits  related  to  the  quality  of  the  nesting  site?  

� Two  variables  ¡ One  quantitative  (number  of  circuits)  ¡ One  categorical  (quality  –  low  or  high)  

� Can  do  anything  for  one  quantitative  variable,  broken  down  by  categorical  groups  

 

 

 

 

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Side-­‐by-­‐Side  Boxplots  

Minitab: Graph -> Boxplot -> One Y -> With Groups

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Stacked  Dotplots  

Minitab: Graph -> Dotplot -> One Y -> With Groups

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Overlaid  Histograms  

Minitab: Graph -> Histogram -> With Groups

Page 4: DescribingData:$ Two$Variables SAT:(μ(=(1500,(σ=325(9/14/15 2 Statistics:UnlockingthePowerofData ( (((Lock 5( FiveNumberSummary$! Five0Number0Summary: 0 (((Min 0 Q 10 m 0 Q 30 Max

9/14/15  

4  

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Quantitative  Statistics  by  a  Categorical  Variable  

� Any  of  the  statistics  we  use  for  a  quantitative  variable  can  be  looked  at  separately  for  each  level  of  a  categorical  variable    

 

 

Minitab: Stat -> Basic Statistics -> Display Descriptive Statistics -> By variables

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Difference  in  Means  � Often,  when  comparing  a  quantitative  variable  across  two  categories,  we  compute  the  difference  in  means    

 

!!xH − xL = 90.5−30=60.5Honeybees  perform  60.5  circuits  more,  on  average,  for  the  high  quality  site  as  opposed  to  the  low  quality  site.  

 

 

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Association?  

Does  there  appear  to  be  an  association  between  number  of  waggle  circuits  and  quality  of  potential  nesting  site?  

a)   Yes  b)   No    

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Summary:  One  Quantitative  and  One  Categorical  

�  Summary  Statistics  ¡  Any  summary  statistics  for  quantitative  variables,  broken  down  by  groups  

¡  Difference  in  means  

�  Visualization  ¡  Side-­‐by-­‐side  graphs  

 

 

 

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Two  Quantitative  Variables  

�  How  is  duration  of  the  dance  related  to  distance  to  a  nesting  site?  

�  Two  quantitative  variables  

�  Summary  Statistics:  correlation  �  Visualization:  scatterplot    

 

 

 

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Scatterplot  

A  scatterplot  is  the  graph  of  the  relationship  between  two  quantitative  variables.  

Minitab: Graph -> Scatterplot -> Simple

Page 5: DescribingData:$ Two$Variables SAT:(μ(=(1500,(σ=325(9/14/15 2 Statistics:UnlockingthePowerofData ( (((Lock 5( FiveNumberSummary$! Five0Number0Summary: 0 (((Min 0 Q 10 m 0 Q 30 Max

9/14/15  

5  

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Direction  of  Association  �  A  positive  association  means  that  values  of  one  variable  tend  to  be  higher  when  values  of  the  other  variable  are  higher  �  A    negative  association  means  that  values  of  one  variable  tend  to  be  lower  when  values  of  the  other  variable  are  higher  �  Two  variables  are  not  associated    if  knowing  the  value  of  one  variable  does  not  give  you  any  information  about  the  value  of  the  other  variable  

 

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Correlation  

The  correlation  is  a  measure  of  the  strength  and  direction  of  linear  association  

between  two  quantitative  variables  

•  Sample  correlation:  r  •  Population  correlation:  ρ  (“rho”)  

Minitab: Stat -> Basic Statistics -> Correlation

r    =  0.994  for  duration  of  dance  and  distance  to  site    

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Correlation  1.  -­‐1  ≤  r  ≤  1  2.  The  sign  indicates  the  direction  of  association  

1.  positive  association:  r  >  0  2.  negative  association:  r  <  0  3.  no  linear  association:  r  ≠  0  

3.  The  closer  r  is  to  ±1,  the  stronger  the  linear  association  4.   r  has  no  units  and  does  not  depend  on  the  units  of  measurement  5.  The  correlation  between  X  and  Y  is  the  same  as  the  correlation  between  Y  and  X  

 

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Correlation  Guessing  Game  http://www.istics.net/Correlations/    

 

 

Enter  PennState  for  the  group  ID.    Highest  scorer  in  the  class  by  the  Birst  exam  gets  one  extra  credit  point  on  Exam  1!    

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

3.0 3.5 4.0 4.5 5.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

Malevolence Rating of Uniform

z-sc

ore

for P

enal

ty Y

ards

Correlation  

r    =  0.43  

NFL  Teams  

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Correlation  Cautions  1.  Correlation  can  be  heavily  affected  by  outliers.    Always  plot  your  data!  

 

 

 

 

 

Page 6: DescribingData:$ Two$Variables SAT:(μ(=(1500,(σ=325(9/14/15 2 Statistics:UnlockingthePowerofData ( (((Lock 5( FiveNumberSummary$! Five0Number0Summary: 0 (((Min 0 Q 10 m 0 Q 30 Max

9/14/15  

6  

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Testosterone  Levels  and  Time  What  is  the  correlation  between  testosterone  levels  and  hour  of  the  day?  a)   Positive  b)   Negative  c)   About  0  

 Are  testosterone  level  and  hour  of  the  day  associated?  a)   Yes  b)   No  

 Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Correlation  Cautions  1.  Correlation  can  be  heavily  affected  by  outliers.    Always  plot  your  data!  

2.   r  =  0  means  no  linear  association.    The  variables  could  still  be  otherwise  associated.    Always  plot  your  data!  

 

 

 

 

 Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

TVs  and  Life  Expectancy  

0 200 400 600 800 1000

4050

6070

80

TV and Life Expectancy

TVs per 1000 People

Life

Exp

ecta

ncy

Angola

Australia

Cambodia

Canada

ChinaEgypt

France

Haiti

Iraq

Japan

Madagascar

Mexico

Morocco

Pakistan

Russia

South Africa

Sri Lanka

Uganda

United KingdomUnited States

Vietnam

Yemen

r = 0.74

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Correlation  Cautions  1.  Correlation  can  be  heavily  affected  by  outliers.    Always  plot  your  data!  

2.   r  =  0  means  no  linear  association.    The  variables  could  still  be  otherwise  associated.    Always  plot  your  data!  

3.  Correlation  does  not  imply  causation!  

 

 

 

 

 

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

Summary:  Two  Quantitative  Variables  

�  Summary  Statistics:  correlation  �  Visualization:  scatterplot    

 

 

 

Statistics:  Unlocking  the  Power  of  Data                                      Lock5  

To  Do  � Read  Sections  2.4  and  2.5  

� Do  HW  2.2,  2.3,  2.4,  2.5  (due  Friday,  9/18)