Chapter 4 Displaying and Summarizing Quantitative Data CHAPTER OBJECTIVES At the conclusion of this...

77
Chapter 4 Displaying and Summarizing Quantitative Data CHAPTER OBJECTIVES At the conclusion of this chapter you should be able to: 1) Construct graphs that appropriately describe quantitative data 2) Calculate and interpret numerical summaries of quantitative data. 3) Combine numerical methods with graphical methods to analyze a data set. 4) Apply graphical methods of summarizing data to choose appropriate numerical summaries. 5) Apply software and/or calculators to automate graphical and numerical summary procedures.

Transcript of Chapter 4 Displaying and Summarizing Quantitative Data CHAPTER OBJECTIVES At the conclusion of this...

  • Slide 1
  • Slide 2
  • Chapter 4 Displaying and Summarizing Quantitative Data CHAPTER OBJECTIVES At the conclusion of this chapter you should be able to: n 1)Construct graphs that appropriately describe quantitative data n 2)Calculate and interpret numerical summaries of quantitative data. n 3)Combine numerical methods with graphical methods to analyze a data set. n 4)Apply graphical methods of summarizing data to choose appropriate numerical summaries. n 5)Apply software and/or calculators to automate graphical and numerical summary procedures.
  • Slide 3
  • Displaying Quantitative Data Histograms Stem and Leaf Displays
  • Slide 4
  • Relative Frequency Histogram of Exam Grades 0.05.10.15.20.25.30 405060708090 Grade Relative frequency 100
  • Slide 5
  • Frequency Histogram
  • Slide 6
  • Histograms A histogram shows three general types of information: n It provides visual indication of where the approximate center of the data is. n We can gain an understanding of the degree of spread, or variation, in the data. n We can observe the shape of the distribution.
  • Slide 7
  • All 200 m Races 20.2 secs or less
  • Slide 8
  • Histograms Showing Different Centers
  • Slide 9
  • Histograms Showing Different Centers (football head coach salaries)
  • Slide 10
  • Histograms - Same Center, Different Spread (football head coach salaries)
  • Slide 11
  • Excel Example: 2012-13 NFL Salaries
  • Slide 12
  • Statcrunch Example: 2012-13 NFL Salaries
  • Slide 13
  • Grades on a statistics exam Data: 75 66 77 66 64 73 91 65 59 86 61 86 61 58 70 77 80 58 94 78 62 79 83 54 52 45 82 48 67 55
  • Slide 14
  • Frequency Distribution of Grades Class Limits Frequency 40 up to 50 50 up to 60 60 up to 70 70 up to 80 80 up to 90 90 up to 100 Total 2 6 8 7 5 2 30
  • Slide 15
  • Relative Frequency Distribution of Grades Class Limits Relative Frequency 40 up to 50 50 up to 60 60 up to 70 70 up to 80 80 up to 90 90 up to 100 2/30 =.067 6/30 =.200 8/30 =.267 7/30 =.233 5/30 =.167 2/30 =.067
  • Slide 16
  • Relative Frequency Histogram of Grades 0.05.10.15.20.25.30 405060708090 Grade Relative frequency 100
  • Slide 17
  • Based on the histo- gram, about what percent of the values are between 47.5 and 52.5? 1. 50% 2. 5% 3. 17% 4. 30% Countdown 10
  • Slide 18
  • Stem and leaf displays n Have the following general appearance stemleaf 18 9 21 2 8 9 9 32 3 8 9 40 1 56 7 64
  • Slide 19
  • Stem and Leaf Displays n Partition each no. in data into a stem and leaf n Constructing stem and leaf display 1) deter. stem and leaf partition (5-20 stems) 2) write stems in column with smallest stem at top; include all stems in range of data 3) only 1 digit in leaves; drop digits or round off 4) record leaf for each no. in corresponding stem row; ordering the leaves in each row helps
  • Slide 20
  • Example: employee ages at a small company 18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39; stem: 10s digit; leaf: 1s digit n 18: stem=1; leaf=8; 18 = 1 | 8 stemleaf 18 9 21 2 8 9 9 32 3 8 9 40 1 56 7 64
  • Slide 21
  • Suppose a 95 yr. old is hired stemleaf 18 9 21 2 8 9 9 32 3 8 9 40 1 56 7 64 7 8 95
  • Slide 22
  • Number of TD passes by NFL teams: 2012-2013 season ( stems are 10s digit) stemleaf 4343 03 247 26677789 201222233444 113467889 08
  • Slide 23
  • Pulse Rates n = 138
  • Slide 24
  • Advantages/Disadvantages of Stem-and-Leaf Displays n Advantages 1) each measurement displayed 2) ascending order in each stem row 3) relatively simple (data set not too large) n Disadvantages display becomes unwieldy for large data sets
  • Slide 25
  • Population of 185 US cities with between 100,000 and 500,000 n Multiply stems by 100,000
  • Slide 26
  • Back-to-back stem-and-leaf displays. TD passes by NFL teams: 1999-2000, 2012-13 multiply stems by 10 1999-20002012-13 2403 637 2324 665526677789 43322221100201222233444 9998887666167889 4211134 08
  • Slide 27
  • Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic. How many pulses are between 67 and 77? Stems are 10s digits 1. 4 2. 6 3. 8 4. 10 5. 12 Countdown 10
  • Slide 28
  • Interpreting Graphical Displays: Shape n A distribution is symmetric if the right and left sides of the histogram are approximately mirror images of each other. Symmetric distribution Complex, multimodal distribution Not all distributions have a simple overall shape, especially when there are few observations. Skewed distribution A distribution is skewed to the right if the right side of the histogram (side with larger values) extends much farther out than the left side. It is skewed to the left if the left side of the histogram extends much farther out than the right side.
  • Slide 29
  • Heights of Students in Recent Stats Class
  • Slide 30
  • Shape (cont.)Female heart attack patients in New York state Age: left-skewedCost: right-skewed
  • Slide 31
  • AlaskaFlorida Shape (cont.): Outliers An important kind of deviation is an outlier. Outliers are observations that lie outside the overall pattern of a distribution. Always look for outliers and try to explain them. The overall pattern is fairly symmetrical except for 2 states clearly not belonging to the main trend. Alaska and Florida have unusual representation of the elderly in their population. A large gap in the distribution is typically a sign of an outlier.
  • Slide 32
  • Center: typical value of frozen personal pizza? ~$2.65
  • Slide 33
  • Spread: fuel efficiency 4, 8 cylinders 4 cylinders: more spread8 cylinders: less spread
  • Slide 34
  • Other Graphical Methods for Economic Data n Time plots plot observations in time order, with time on the horizontal axis and the vari- able on the vertical axis ** Time series measurements are taken at regular intervals (monthly unemployment, quarterly GDP, weather records, electricity demand, etc.)
  • Slide 35
  • Heat Maps
  • Slide 36
  • Unemployment Rate, by Educational Attainment
  • Slide 37
  • Water Use During Super Bowl
  • Slide 38
  • Winning Times 100 M Dash
  • Slide 39
  • Numerical Summaries of Quantitative Data Numerical and More Graphical Methods to Describe Univariate Data
  • Slide 40
  • 2 characteristics of a data set to measure n center measures where the middle of the data is located n variability measures how spread out the data is
  • Slide 41
  • The median: a measure of center Given a set of n measurements arranged in order of magnitude, Median=middle valuen odd mean of 2 middle values,n even n Ex. 2, 4, 6, 8, 10; n=5; median=6 n Ex. 2, 4, 6, 8; n=4; median=(4+6)/2=5
  • Slide 42
  • Student Pulse Rates (n=62) 38, 59, 60, 60, 62, 62, 63, 63, 64, 64, 65, 67, 68, 70, 70, 70, 70, 70, 70, 70, 71, 71, 72, 72, 73, 74, 74, 75, 75, 75, 75, 76, 77, 77, 77, 77, 78, 78, 79, 79, 80, 80, 80, 84, 84, 85, 85, 87, 90, 90, 91, 92, 93, 94, 94, 95, 96, 96, 96, 98, 98, 103 Median = (75+76)/2 = 75.5
  • Slide 43
  • Medians are used often n Year 2014 baseball salaries Median $1,450,000 (max=$28,000,000 Zack Greinke; min=$500,000) n Median fan age: MLB 45; NFL 43; NBA 41; NHL 39 n Median existing home sales price: May 2011 $166,500; May 2010 $174,600 n Median household income (2008 dollars) 2009 $50,221; 2008 $52,029
  • Slide 44
  • The median splits the histogram into 2 halves of equal area
  • Slide 45
  • Examples n Example: n = 7 17.5 2.8 3.2 13.9 14.1 25.3 45.8 n Example n = 7 (ordered): n 2.8 3.2 13.9 14.1 17.5 25.3 45.8 n Example: n = 8 17.5 2.8 3.2 13.9 14.1 25.3 35.7 45.8 n Example n =8 (ordered) 2.8 3.2 13.9 14.1 17.5 25.3 35.7 45.8 m = 14.1 m = (14.1+17.5)/2 = 15.8
  • Slide 46
  • Below are the annual tuition charges at 7 public universities. What is the median tuition? 4429 4960 4971 5245 5546 7586 1. 5245 2. 4965.5 3. 4960 4. 4971 Countdown 10
  • Slide 47
  • Below are the annual tuition charges at 7 public universities. What is the median tuition? 4429 4960 5245 5546 4971 5587 7586 1. 5245 2. 4965.5 3. 5546 4. 4971 Countdown 10
  • Slide 48
  • Measures of Spread n The range and interquartile range
  • Slide 49
  • Ways to measure variability range=largest-smallest OK sometimes; in general, too crude; sensitive to one large or small data value The range measures spread by examining the ends of the data A better way to measure spread is to examine the middle portion of the data
  • Slide 50
  • m = median = 3.4 Q 1 = first quartile = 2.3 Q 3 = third quartile = 4.2 Quartiles: Measuring spread by examining the middle The first quartile, Q 1, is the value in the sample that has 25% of the data at or below it (Q 1 is the median of the lower half of the sorted data). The third quartile, Q 3, is the value in the sample that has 75% of the data at or below it (Q 3 is the median of the upper half of the sorted data).
  • Slide 51
  • Quartiles and median divide data into 4 pieces Q1 M Q3 Q1 M Q3 1/4 1/41/4 1/4
  • Slide 52
  • Quartiles are common measures of spread n http://oirp.ncsu.edu/ir/admit http://oirp.ncsu.edu/ir/admit n http://oirp.ncsu.edu/univ/peer http://oirp.ncsu.edu/univ/peer n University of Southern California University of Southern California n Economic Value of College Majors Economic Value of College Majors
  • Slide 53
  • Rules for Calculating Quartiles Step 1: find the median of all the data (the median divides the data in half) Step 2a: find the median of the lower half; this median is Q 1 ; Step 2b: find the median of the upper half; this median is Q 3. Important: when n is odd include the overall median in both halves; when n is even do not include the overall median in either half.
  • Slide 54
  • Example n 2 4 6 8 10 12 14 16 18 20 n = 10 n Median n m = (10+12)/2 = 22/2 = 11 n Q 1 : median of lower half 2 4 6 8 10 Q 1 = 6 n Q 3 : median of upper half 12 14 16 18 20 Q 3 = 16 11
  • Slide 55
  • Quartile example: odd no. of data values n HRs hit by Babe Ruth in each season as a Yankee 54 59 35 41 46 25 47 60 54 46 49 46 41 34 22 Ordered values: 22 25 34 35 41 41 46 46 46 47 49 54 54 59 60 Median: value in ordered position 8. median = 46 Lower half (including overall median): 22 25 34 35 41 41 46 46 Upper half (including overall median): 46 46 47 49 54 54 59 60
  • Slide 56
  • Pulse Rates n = 138 Median: mean of pulses in locations 69 & 70: median= (70+70)/2=70 Q 1 : median of lower half (lower half = 69 smallest pulses); Q 1 = pulse in ordered position 35; Q 1 = 63 Q 3 median of upper half (upper half = 69 largest pulses); Q 3 = pulse in position 35 from the high end; Q 3 =78
  • Slide 57
  • Below are the weights of 31 linemen on the NCSU football team. What is the value of the first quartile Q 1 ? #stemleaf 22255 42357 62426 7257 1026257 122759 (4)281567 152935599 1030333 73145 532155 2336 1340 1. 287 2. 257.5 3. 263.5 4. 262.5 Countdown 10
  • Slide 58
  • Interquartile range n lower quartile Q 1 n middle quartile: median n upper quartile Q 3 n interquartile range (IQR) IQR = Q 3 Q 1 measures spread of middle 50% of the data
  • Slide 59
  • Example: beginning pulse rates n Q 3 = 78; Q 1 = 63 n IQR = 78 63 = 15
  • Slide 60
  • Below are the weights of 31 linemen on the NCSU football team. The first quartile Q 1 is 263.5. What is the value of the IQR? #stemleaf 22255 42357 62426 7257 1026257 122759 (4)281567 152935599 1030333 73145 532155 2336 1340 1. 23.5 2. 39.5 3. 46 4. 69.5 Countdown 10
  • Slide 61
  • 5-number summary of data n Minimum Q 1 median Q 3 maximum n Pulse data 45 63 70 78 111
  • Slide 62
  • m = median = 3.4 Q 3 = third quartile = 4.2 Q 1 = first quartile = 2.3 Largest = max = 6.1 Smallest = min = 0.6 Five-number summary: min Q 1 m Q 3 max Boxplot: display of 5-number summary BOXPLOT
  • Slide 63
  • Boxplot: display of 5-number summary n Example: age of 66 crush victims at rock concerts 1999-2000. 5-number summary: 13 17 19 22 47
  • Slide 64
  • Boxplot construction 1) construct box with ends located at Q1 and Q3; in the box mark the location of median (usually with a line or a +) 2) fences are determined by moving a distance 1.5(IQR) from each end of the box; 2a) upper fence is 1.5*IQR above the upper quartile 2b) lower fence is 1.5*IQR below the lower quartile Note: the fences only help with constructing the boxplot; they do not appear in the final boxplot display
  • Slide 65
  • Box plot construction (cont.) 3) whiskers: draw lines from the ends of the box left and right to the most extreme data values found within the fences; 4) outliers: special symbols represent each data value beyond the fences; 4a) sometimes a different symbol is used for far outliers that are more than 3 IQRs from the quartiles
  • Slide 66
  • Q 3 = third quartile = 4.2 Q 1 = first quartile = 2.3 Largest = max = 7.9 Boxplot: display of 5-number summary BOXPLOT 8 Interquartile range Q 3 Q 1 = 4.2 2.3 = 1.9 Distance to Q 3 7.9 4.2 = 3.7 1.5 * IQR = 1.5*1.9=2.85. Individual #25 has a value of 7.9 years, which is 3.7 years above the third quartile. This is more than 2.85 = 1.5*IQR above Q 3. Thus, individual #25 is a suspected outlier.
  • Slide 67
  • ATM Withdrawals by Day, Month, Holidays
  • Slide 68
  • Slide 69
  • Beg. of class pulses (n=138) n Q 1 = 63, Q 3 = 78 n IQR=78 63=15 n 1.5(IQR)=1.5(15)=22.5 n Q 1 - 1.5(IQR): 63 22.5=40.5 n Q 3 + 1.5(IQR): 78 + 22.5=100.5 70 63 78 40.5 100.5 45
  • Slide 70
  • Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who gained at least 50 yards. What is the approximate value of Q 3 ? 0 136 273 410 547 684 821 958 1095 1232 1369 Pass Catching Yards by Receivers 1. 450 2. 750 3. 215 4. 545 Countdown 10
  • Slide 71
  • Rock concert deaths: histogram and boxplot
  • Slide 72
  • Automating Boxplot Construction n Excel out of the box does not draw boxplots. n Many add-ins are available on the internet that give Excel the capability to draw box plots. n Statcrunch (http://statcrunch.stat.ncsu.edu) draws box plots.
  • Slide 73
  • Q 3 = third quartile = 4.2 Q 1 = first quartile = 2.3 Largest = max = 7.9 Statcrunch Boxplot
  • Slide 74
  • Tuition 4-yr Colleges
  • Slide 75
  • Statcrunch: 2012-13 NFL Salaries by Position
  • Slide 76
  • College Football Head Coach Salaries by Conference
  • Slide 77
  • 2013 Major League Baseball Salaries by Team
  • Slide 78
  • End of General Numerical Summaries. Next: Numerical Summaries of Symmetric Data