Chapter 2 Methods for Describing Sets of Data. Objectives Describe Data using Graphs Describe Data...

Chapter 2

Methods for Describing Sets of Data

Objectives

Describe Data using Graphs

Describe Data using Charts

Describing Qualitative Data

• Qualitative data are nonnumeric in nature• Best described by using Classes• 2 descriptive measures class frequency – number of data points in a

class class relative = class frequency

frequency total number of data points in data set

class percentage – class relative freq. x 100

Describing Qualitative Data – Displaying Descriptive Measures

Summary Table

Class Frequency Class percentage – class relative frequency x 100

Describing Qualitative Data – Qualitative Data Displays

Bar Graph


Pie chart


Pareto Diagram

Graphical Methods for Describing Quantitative Data

The Data

Company Percentage Company Percentage Company Percentage Company Percentage1 13.5 14 9.5 27 8.2 39 6.52 8.4 15 8.1 28 6.9 40 7.53 10.5 16 13.5 29 7.2 41 7.14 9.0 17 9.9 30 8.2 42 13.25 9.2 18 6.9 31 9.6 43 7.76 9.7 19 7.5 32 7.2 44 5.97 6.6 20 11.1 33 8.8 45 5.28 10.6 21 8.2 34 11.3 46 5.69 10.1 22 8.0 35 8.5 47 11.710 7.1 23 7.7 36 9.4 48 6.011 8.0 24 7.4 37 10.5 49 7.812 7.9 25 6.5 38 6.9 50 6.513 6.8 26 9.5

Percentage of Revenues Spent on Research and Development


For describing, summarizing, and detecting patterns in such data, we can use three graphical methods:

• dot plots• stem-and-leaf displays

• histograms


Dot Plot


Stem-and-Leaf Display


Histogram


More on Histograms

Number of Observations in Data Set Number of Classes

Less than 25 5-6

25-50 7-14

More than 50 15-20

Summation Notation

Used to simplify summation instructions

Each observation in a data set is identified by a subscript

x1, x2, x3, x4, x5, …. xn

Notation used to sum the above numbers together is

n

n

i

i xxxxxx

4321

1

Summation Notation

Data set of 1, 2, 3, 4

Are these the same? and

4

1

2

i

ix24

1

i

ix

301694124

23

22

21

24

1

xxxxxi

i

100104321 22224

1

4321

xxxxxi

i

Numerical Measures of Central Tendency

• Central Tendency – tendency of data to center about certain numerical values

• 3 commonly used measures of Central Tendency:

Mean

Median

Mode


The Mean

• Arithmetic average of the elements of the data set

• Sample mean denoted by

• Population mean denoted by

• Calculated as and

x

n

xx

n

i

i 1

n

xn

i

i 1


The Median

• Middle number when observations are arranged in order

• Median denoted by m

• Identified as the observation if n is odd, and the mean of the and observations if n is even

5.02

n

2

n1

2n


The Mode

• The most frequently occurring value in the data set

• Data set can be multi-modal – have more than one mode

• Data displayed in a histogram will have a modal class – the class with the largest frequency


The Data set 1 3 5 6 8 8 9 11 12

Mean

Median is the or 5th observation, 8

Mode is 8

79

63

9

121198865311

n

xx

n

i

i

5.02

n

Numerical Measures of Variability

• Variability – the spread of the data across possible values

• 3 commonly used measures of Variability: Range

Variance

Standard Deviation


The Range

• Largest measurement minus the smallest measurement

• Loses sensitivity when data sets are large

These 2 distributionshave the same range.

How much does therange tell you about the data variability?


The Sample Variance (s2)

• The sum of the squared deviations from the mean divided by (n-1). Expressed as units squared

• Why square the deviations? The sum of the deviations from the mean is zero

1

)(1

2

2

n

xxs

n

ii


The Sample Standard Deviation (s)

• The positive square root of the sample variance

• Expressed in the original units of measurement

21

2

1

)(s

n

xxs

n

ii


Samples and Populations - Notation

Sample Population

Variance s2

Standard Deviation s

2

Numerical Measures of Relative Standing

Descriptive measures of relationship of a measurement to the rest of the data

Common measures:• percentile ranking• z-score


Percentile rankings make use of the pth percentileThe median is an example of percentiles.Median is the 50th percentile – 50 % of observations lie above it, and 50% lie below itFor any p, the pth percentile has p% of the measures lying below it, and (100-p)% above it


z-score – the distance between a measurement x and the mean, expressed in standard units

Use of standard units allows comparison across data sets

x

zs

xxz


More on z-scores

Z-scores follow the empirical rule for mounded distributions

Methods for Detecting Outliers

Outlier – an observation that is unusually large or small relative to the data values being described

Causes:• Invalid measurement• Misclassified measurement• A rare (chance) event

2 detection methods:• Box Plots• z-scores


Box Plots

• based on quartiles, values that divide the dataset into 4 groups

• Lower Quartile QL – 25th percentile

• Middle Quartile - median

• Upper Quartile QU – 75th percentile

• Interquartile Range (IQR) = QU - QL


Box Plots

Not on plot – inner and outer fences, which determine potential outliers

QU (hinge)

QL (hinge)

Median

Potential Outlier

Whiskers


Rules of thumb• Box Plots

– measurements between inner and outer fences are suspect

– measurements beyond outer fences are highly suspect

• Z-scores– Scores of 3 in mounded distributions (2 in

highly skewed distributions) are considered outliers

Chapter 2 Methods for Describing Sets of Data. Objectives Describe Data using Graphs Describe Data...

Documents

Transcript of Chapter 2 Methods for Describing Sets of Data. Objectives Describe Data using Graphs Describe Data...