STATISTICS PNPCOMPTROLLERSHIPCOURSE. Statistics The term has two meanings.The term has two meanings....

STATISTICSSTATISTICS

PNP PNP

COMPTROLLERSHIPCOMPTROLLERSHIP

COURSECOURSE

StatisticsStatistics

• The term has two meanings.The term has two meanings.• Statistics (singular) is the science Statistics (singular) is the science

of collecting, organizing, of collecting, organizing, analyzing, and interpreting analyzing, and interpreting information.information.

• Statistics (plural) are numbers Statistics (plural) are numbers calculated from a set or collection calculated from a set or collection of information.of information.

General CategoriesGeneral Categories

• Descriptive Statistics comprises Descriptive Statistics comprises those methods used to organize and those methods used to organize and describe information that has been describe information that has been collected.collected.

• Inferential Statistics involves the Inferential Statistics involves the theory of probability and comprises theory of probability and comprises those methods and techniques for those methods and techniques for making generalizations, predictions, making generalizations, predictions, or estimates about the population by or estimates about the population by using limited information.using limited information.

Descriptive Descriptive StatisticsStatistics

Organizing DataOrganizing Data

• Data are the building blocks of statistics.Data are the building blocks of statistics.• They are generally categorized as They are generally categorized as

quantitative or qualitative.quantitative or qualitative.• They are also classified according to the They are also classified according to the

type of measurement scale used such as:type of measurement scale used such as:– Nominal scaleNominal scale– Ordinal scaleOrdinal scale– Interval scaleInterval scale– Ratio scaleRatio scale

Nominal ScaleNominal Scale

• Nominal scale exists for both the Nominal scale exists for both the quantitative and qualitative data.quantitative and qualitative data.– Nominal scale for quantitative data assigns Nominal scale for quantitative data assigns

numbers to categories to distinguish one numbers to categories to distinguish one from another such as basketball jerseys, from another such as basketball jerseys, postal zip codes, and telephone numbers.postal zip codes, and telephone numbers.

– Nominal scale for qualitative data is an Nominal scale for qualitative data is an unordered grouping of data into discrete unordered grouping of data into discrete categories where each datum can go into categories where each datum can go into only one group such as sex, blood type, or only one group such as sex, blood type, or religion..religion..

Ordinal ScaleOrdinal Scale

• Data measured on a nominal scale Data measured on a nominal scale that is ordered in some fashion are that is ordered in some fashion are referred to as ordinal data.referred to as ordinal data.– Letter grades as A, B, C, D, and FLetter grades as A, B, C, D, and F– Ranks as Inspector, Sr Inspector, Chief Ranks as Inspector, Sr Inspector, Chief

InspectorInspector– Residence numberResidence number– Performance Rating as Poor, Fair, GoodPerformance Rating as Poor, Fair, Good– Grades in school as 1, 2, 3 and so on.Grades in school as 1, 2, 3 and so on.

Interval ScaleInterval Scale

• Data measured on an ordinal scale Data measured on an ordinal scale for which distances between values for which distances between values are calculated are called interval are calculated are called interval data.data.

• The distance between two values is The distance between two values is relevant.relevant.

• Interval data are necessarily Interval data are necessarily quantitative.quantitative.

• An interval scale does not have An interval scale does not have necessarily a zero point, a point necessarily a zero point, a point which indicates the absence of what which indicates the absence of what we are measuring.we are measuring.

ExampleExample

• IQ test scores. We can say an IQ score IQ test scores. We can say an IQ score of 180 is higher than an IQ score of 90. of 180 is higher than an IQ score of 90. We can also say that it is 90 points We can also say that it is 90 points higher. But we cannot say that a person higher. But we cannot say that a person with an IQ score of 180 is twice as smart with an IQ score of 180 is twice as smart as a person with an IQ score of 90. as a person with an IQ score of 90. Likewise, a given difference between Likewise, a given difference between two IQ scores does not have always the two IQ scores does not have always the same meaning. Say 100-90, and 150-same meaning. Say 100-90, and 150-140, may have different interpretations 140, may have different interpretations even if the difference is the same as 10. even if the difference is the same as 10.

Another ExampleAnother Example

• Celsius Temperature. A temperature Celsius Temperature. A temperature of 80degrees C is 40 degrees of 80degrees C is 40 degrees warmer than a temperature of warmer than a temperature of 40degrees C. But it is not correct to 40degrees C. But it is not correct to say that 80degrees C is twice as say that 80degrees C is twice as warm as 40degrees C. Note that warm as 40degrees C. Note that 0degree C does not represent the 0degree C does not represent the absence of heat or zero heat. The absence of heat or zero heat. The absence of heat is represented by absence of heat is represented by 0degree Kelvin equivalent to -0degree Kelvin equivalent to -273degrees C. 273degrees C.

Ratio ScaleRatio Scale

• Data measured on an interval scale with a Data measured on an interval scale with a zero point meaning “none” are called ratio zero point meaning “none” are called ratio data. Because the zero point of the Celsius data. Because the zero point of the Celsius scale does not represent the absence of scale does not represent the absence of heat, the Celsius scale is not a ratio scale. heat, the Celsius scale is not a ratio scale. The Kelvin scale is a ratio scale. Examples The Kelvin scale is a ratio scale. Examples of other ratio scales are those commonly of other ratio scales are those commonly used to measure units such as feet, used to measure units such as feet, meters, pounds, and pesos. The results of meters, pounds, and pesos. The results of counting objects are also ratio data.counting objects are also ratio data.

Organizing Data Using Organizing Data Using TablesTables

• The objective of organizing data is to arrange The objective of organizing data is to arrange a set of data into useful form in order to a set of data into useful form in order to reveal essential features and simplify certain reveal essential features and simplify certain analyses.analyses.

• Data that are not organized in some fashion Data that are not organized in some fashion are called raw data.are called raw data.

• One method of arranging data is to construct One method of arranging data is to construct an ordered array; that is arranging data from an ordered array; that is arranging data from low to high (or high to low).low to high (or high to low).

• If the number of data is large, the data may If the number of data is large, the data may be difficult to manage, thus tables are often be difficult to manage, thus tables are often used as a general approach to organizing raw used as a general approach to organizing raw data.data.

Ungrouped Frequency Ungrouped Frequency TablesTables

• The frequency of a measurement or The frequency of a measurement or category is the total number of times the category is the total number of times the measurement or category occurs in a measurement or category occurs in a collection of data. The symbol collection of data. The symbol f f is used is used to denote the frequency of a to denote the frequency of a measurement.measurement.

• For example: A sample data For example: A sample data representing the number of free throws representing the number of free throws missed by a basketball team during the missed by a basketball team during the last 7 games:last 7 games:

77 22 88 44 22 77 22

Frequency Table of Free Frequency Table of Free Throw DataThrow Data

Data xData x ff

22 33

44 11

77 22

88 11

77

Tally MarksTally Marks

• For a very large number of data, an For a very large number of data, an intermediate step is to count intermediate step is to count observations through the use of tally observations through the use of tally marks to aid in determining the marks to aid in determining the frequency frequency ff for each observation. for each observation.

• Corresponding to each observation Corresponding to each observation we place a tally mark in a tally we place a tally mark in a tally column.column.

• After all tallies are placed, they are After all tallies are placed, they are counted for each measurement counted for each measurement xx to to determine the frequency.determine the frequency.

ExampleExample

99 88 77 88 44 33

22 11 00 55 33 22

11 11 77 33 22 88

77 66 66 44 33 22

22 00 99 44 66 99

66 99 44 33 55 77

33 22 11 44 44 22

xx Frequency Frequency ff

00 22

11 44

22 77

33 66

44 66

55 22

66 44

77 44

88 33

99 44

4242

Grouped Frequency Grouped Frequency TablesTables

• A grouped frequency table shows frequencies A grouped frequency table shows frequencies according to groups or classes of measurements.according to groups or classes of measurements.

• For example, a memorial hospital wants to study For example, a memorial hospital wants to study whether its emergency room staffing is whether its emergency room staffing is adequate. To start the study, the manager tracks adequate. To start the study, the manager tracks down the number of people visiting the down the number of people visiting the emergency room each day for a 12-day period emergency room each day for a 12-day period with result as:with result as:

77 4343 88 2222 1313 2828 3636 1818 2323 2121 1515 5252

StepsSteps

• The manager constructs six groupings or classes, The manager constructs six groupings or classes, the first class representing 1-10 patients; the the first class representing 1-10 patients; the second class, 11-20 patients; 3second class, 11-20 patients; 3rdrd class, 21-30 class, 21-30 patients; 4patients; 4thth class, 31-40; 5 class, 31-40; 5thth class, 41-50; and, the class, 41-50; and, the 66thth class, 51-60. class, 51-60.

• For the first class, the lower class limit is 1 while For the first class, the lower class limit is 1 while the upper class limit is 10. The rest of the classes the upper class limit is 10. The rest of the classes will have a similar pattern of lower and upper will have a similar pattern of lower and upper limits.limits.

• Tally the number of patients that fall within each Tally the number of patients that fall within each class.class.

• Construct the grouped frequency table. Construct the grouped frequency table.

Grouped Frequency Table for Emergency Grouped Frequency Table for Emergency Room DataRoom Data

ClassClass Frequency Frequency ff

1-101-10 22

11-2011-20 33

21-3021-30 44

31-4031-40 11

41-5041-50 11

51-6051-60 11

1212

Basic GuidelinesBasic Guidelines

• Each class should Each class should have the same width.have the same width.

• No two classes No two classes should overlap.should overlap.

• Each piece of data Each piece of data should belong to a should belong to a class.class.

ClassClass FrequeFrequency ncy ff

1-101-10 22

11-2011-20 33

21-3021-30 44

31-4031-40 11

41-5041-50 11

51-6051-60 11

Class Boundaries and Class Class Boundaries and Class WidthsWidths

• Class boundaries determine class widths.Class boundaries determine class widths.• Class boundaries for grouped frequency table are Class boundaries for grouped frequency table are

determined by considering the unit or precision of determined by considering the unit or precision of measurement.measurement.

• The lower class boundary of a class interval is located The lower class boundary of a class interval is located one-half unit below the lower class limit. The upper class one-half unit below the lower class limit. The upper class boundary is one-half unit above the upper class limit.boundary is one-half unit above the upper class limit.

• The class width The class width w w for any class interval is found by for any class interval is found by subtracting the lower class boundary from the upper subtracting the lower class boundary from the upper class boundary, thus:class boundary, thus:

w w = = ll22 – – ll11

where: where: ll1 1 is the lower class boundary; and, is the lower class boundary; and,

ll2 2 is the upper class boundary for each class interval. is the upper class boundary for each class interval.

Basic Rules in ConstructingBasic Rules in Constructinga Grouped Frequency Tablea Grouped Frequency Table

• How many classes should be used?How many classes should be used?• What should be the width of each class?What should be the width of each class?• At what value should the first class start?At what value should the first class start?• How is the class mark or midpoint How is the class mark or midpoint

computed?computed?

Basic Rules in Constructing a Grouped Basic Rules in Constructing a Grouped Frequency TableFrequency Table

• For number of classes, Sturges’ rule:For number of classes, Sturges’ rule:

c = 3.3(log n) + 1c = 3.3(log n) + 1• For width, the rule is For width, the rule is w = R/cw = R/c where where RR is the is the

range computed by subtracting the smallest range computed by subtracting the smallest measurement measurement LL from the largest measurement from the largest measurement UU; ; thus, thus, R = U – LR = U – L..

• The lower limit of the first class should be near and The lower limit of the first class should be near and at most as large as the smallest measurement L.at most as large as the smallest measurement L.

• Class mark Class mark XX or theor the midpoint midpoint isis computed by computed by adding the lower class limit adding the lower class limit aa and upper class limit and upper class limit bb and dividing the sum by 2; thus, ( and dividing the sum by 2; thus, (a + b) / 2a + b) / 2

Relative Frequency TableRelative Frequency Table

• It is useful sometimes to express each It is useful sometimes to express each value or class in a frequency table as a value or class in a frequency table as a fraction of the total observations.fraction of the total observations.

• The relative frequency of a class is The relative frequency of a class is found by dividing the frequency found by dividing the frequency f f byby the total number of observations the total number of observations nn..

• The table that describes the relative The table that describes the relative frequencies is then called relative frequencies is then called relative frequency table.frequency table.

Cumulative Frequency Cumulative Frequency TableTable

• There are many occasions when we are There are many occasions when we are interested in the number of observations less interested in the number of observations less than or equal to some value. Example: A than or equal to some value. Example: A teacher may want to know the number of teacher may want to know the number of students who got a score of less than or equal students who got a score of less than or equal to 70% on an examination. The cumulative to 70% on an examination. The cumulative frequency will answer that.frequency will answer that.

• The cumulative frequency for any measurement The cumulative frequency for any measurement or class is the total of the frequency for that or class is the total of the frequency for that measurement or class and the frequencies of all measurement or class and the frequencies of all measurements or classes of smaller value.measurements or classes of smaller value.

Cumulative Relative Frequency Cumulative Relative Frequency TableTable

• Cumulative frequency tables can be constructed Cumulative frequency tables can be constructed also for tables containing relative frequencies or also for tables containing relative frequencies or percentages.percentages.

• The procedures are identical to those used for The procedures are identical to those used for cumulative frequency tables except that relative cumulative frequency tables except that relative frequencies or percentages are used.frequencies or percentages are used.

• Cumulative relative frequencies have many uses. Cumulative relative frequencies have many uses. One is in scoring standardized tests through the One is in scoring standardized tests through the percentilespercentiles method. A percentile score tells what method. A percentile score tells what part of the tested population scored lower. For part of the tested population scored lower. For example, if 50 is said to be the 90example, if 50 is said to be the 90thth percentile in percentile in an examination, it means that 90% of the scores an examination, it means that 90% of the scores were lower than 50.were lower than 50.

ExampleExample

• A final examination result has the following data.A final examination result has the following data.

1717 1515 7878 2121 1010 3232 77 6565 1818 8787

44 2222 3434 4242 99 99 8282 7979 9898 44

4444 6464 6262 7777 22 8181 4545 3737 8383 4444

7777 1313 4141 1616 1717 1313 8282 3737 55 5454

77 6767 8888 4141 6161 2222 9292 1616 6767 8585

• In constructing the frequency table, assume c = 5.In constructing the frequency table, assume c = 5.

Grouped Frequency Grouped Frequency TableTable

Class Class NumberNumber ClassClass XX ff

11 2-212-21 11.511.5 1818

22 22-4122-41 31.531.5 88

33 42-6142-61 51.551.5 66

44 62-8162-81 71.571.5 1010

55 82-10182-101 91.591.5 88

Relative Frequency TableRelative Frequency Table

ClassClass ffRelative Relative

FrequencyFrequency

2-212-21 1818 .36.36

22-4122-41 88 .16.16

42-6142-61 66 .12.12

62-8162-81 1010 .20.20

82-10182-101 88 .16.16

5050

Cumulative Frequency Cumulative Frequency TableTable

Class Class NumberNumber ClassClass Cumulative Cumulative

FrequencyFrequency

11 2-212-21 1818

22 22-4122-41 2626

33 42-6142-61 3232

44 62-8162-81 4242

55 82-10182-101 5050

Cumulative Relative Frequency Cumulative Relative Frequency TableTable

ClassClass ffRelative Relative

FrequencFrequencyy

CumulatiCumulative ve

Relative Relative FrequencFrequenc

yy

2-212-21 1818 .36.36 .36.36

22-4122-41 88 .16.16 .52.52

42-6142-61 66 .12.12 .64.64

62-8162-81 1010 .20.20 .84.84

82-10182-101 88 .16.16 1.001.00

5050

Graphical Graphical Representation of DataRepresentation of Data

0

10

20

30

40

50

60

70

80

90

1stQtr

2ndQtr

3rdQtr

4thQtr

EastWestNorth

A Bar GraphA Bar Graph


1st Qtr2nd Qtr3rd Qtr4th Qtr

A Pie GraphA Pie Graph


0102030405060708090

100

1stQtr

2ndQtr

3rdQtr

4thQtr

EastWestNorth

A Line GraphA Line Graph

Measures of Central Measures of Central TendenciesTendencies

• The first characteristic of a set of data that The first characteristic of a set of data that we want to measure is the center or central we want to measure is the center or central tendency. The purpose is to summarize a tendency. The purpose is to summarize a collection of data to obtain a general collection of data to obtain a general overview that will serve as a representative overview that will serve as a representative for the rest of the data.for the rest of the data.

• Common Measures of Central Tendencies:Common Measures of Central Tendencies:– MeanMean– MedianMedian– ModeMode– MidrangeMidrange

MeanMean

• The mean or arithmetic average is The mean or arithmetic average is found by adding the numbers and then found by adding the numbers and then dividing the sum by the number of dividing the sum by the number of observation observation nn::

χχ = = ΣΣx / nx / n• A population mean is denoted by:A population mean is denoted by:

μμ = = ΣΣx / Nx / N• The mean for grouped data:The mean for grouped data:

χχ = = ΣΣ((f X) / f X) / ΣΣff

Disadvantage of the Disadvantage of the MeanMean

• The mean as a measure of center has a The mean as a measure of center has a disadvantage. It is affected by the disadvantage. It is affected by the extreme measurements on one end of a extreme measurements on one end of a distribution. It depends on the value of distribution. It depends on the value of every measurement and extreme values every measurement and extreme values can lead to the mean misrepresenting the can lead to the mean misrepresenting the data.data.

• In this case, the median may provide a In this case, the median may provide a better measure than the mean inasmuch better measure than the mean inasmuch as it is not affected by the extreme values.as it is not affected by the extreme values.

MedianMedian

• In general, the median is found by first In general, the median is found by first ranking the data.ranking the data.

• If there is an odd number of If there is an odd number of observations, then the median is the observations, then the median is the number in the middle of the number in the middle of the distribution.distribution.

• If the number of observations is even, If the number of observations is even, then the median is computed by adding then the median is computed by adding the two numbers found in the middle the two numbers found in the middle positions and divide the sum by 2.positions and divide the sum by 2.

ModeMode

• The mode, if it exists, is the most frequent The mode, if it exists, is the most frequent measurement or observation.measurement or observation.

• The mode has the advantage of being The mode has the advantage of being easily found especially in small samples easily found especially in small samples and is usually not influenced by extreme and is usually not influenced by extreme measurements on one end of an ordered measurements on one end of an ordered set of data.set of data.

• Example: In an array of data arrange as Example: In an array of data arrange as follows: 1, 2, 3, 3, 3, 4, 5, and 6, the mode follows: 1, 2, 3, 3, 3, 4, 5, and 6, the mode is 3.is 3.

ModeMode

MedianMedianMedianMedian

ModeMode

ModeModeMedianMedian

Rightward skewnessRightward skewness Leftward skewnessLeftward skewness

SymmetrySymmetry

Relationships BetweenRelationships BetweenMean, Median and ModeMean, Median and Mode

Mean < MedianMean < MedianMean > MedianMean > Median

Median = ModeMedian = Mode

MeanMean MeanMean

MeanMean

Mean = MedianMean = Median

Median > ModeMedian > ModeMedian < ModeMedian < Mode

MidrangeMidrange

• The midrange of a set of data is the average The midrange of a set of data is the average of the largest and smallest measurements, of the largest and smallest measurements, thus:thus:

Midrange = (U + L) / 2Midrange = (U + L) / 2• For a data organized in a grouped For a data organized in a grouped

frequency table, the midrange is frequency table, the midrange is approximately the average of the lower approximately the average of the lower class boundary of the first class and the class boundary of the first class and the upper class boundary of the last class, thus:upper class boundary of the last class, thus:

Midrange = (Midrange = (ll1fc 1fc + + ll2lc2lc) / 2) / 2

Measures of Dispersion or Measures of Dispersion or VariabilityVariability

• Quite often, measures of central tendency alone Quite often, measures of central tendency alone do not adequately describe a characteristic being do not adequately describe a characteristic being observed.observed.

• Hence, variability is an important concept in Hence, variability is an important concept in statistics. As a result, there are many measures statistics. As a result, there are many measures of variability for a collection of quantitative data of variability for a collection of quantitative data such as:such as:

•RangeRange

•VarianceVariance

•Standard deviation Standard deviation

•Standard ScoreStandard Score

RangeRange

• As previously defined, range is the As previously defined, range is the difference between the largest and the difference between the largest and the smallest measurements; thus:smallest measurements; thus:

R = U – LR = U – L

where: R is the rangewhere: R is the range

L is the smallest L is the smallest measurementmeasurement

U is the largest U is the largest measurement measurement

Deviation ScoreDeviation Score• Deviation score is the quantity defined by this Deviation score is the quantity defined by this

relationship:relationship:

• Deviation score represents the directed Deviation score represents the directed distance a measurement has from the mean of distance a measurement has from the mean of a set of data.a set of data.

• A positive deviation score means the A positive deviation score means the measurement is above the mean; a negative measurement is above the mean; a negative means the mean is above the measurement; a means the mean is above the measurement; a zero deviation means the measurement is zero deviation means the measurement is equal to the mean.equal to the mean.

x - xx - x

Sum of SquaresSum of Squares

• By adding the deviation scores the By adding the deviation scores the resulting value will be zero, a resulting value will be zero, a useless result for analyzing a set of useless result for analyzing a set of data. To avoid this situation the sum data. To avoid this situation the sum of squares is used.of squares is used.

• Sum of squares SS is computed by Sum of squares SS is computed by first squaring the deviation scores, first squaring the deviation scores, then adding them up; thus:then adding them up; thus:SS = SS = ΣΣ( ( x – x –

x x ))22

VarianceVariance• The variance of a population of measurements The variance of a population of measurements

is defined as the average of the squared is defined as the average of the squared deviation scores denoted by deviation scores denoted by δδ22;; thus: thus: δδ2 2 == SS/N SS/N

• The variance of a sample, denoted byThe variance of a sample, denoted by s s22, is , is defined by the following formula:defined by the following formula: ss2 2 == SS/(SS/(n n – 1)– 1)

• The variance for data in frequency tables is The variance for data in frequency tables is computed by deriving first the sum of squares computed by deriving first the sum of squares using the following formula then proceeding:using the following formula then proceeding: SS = ( SS = ( ΣΣ((f xf x2 2 ) – () – (ΣΣf x f x ))22//ΣΣf f ) )Where Where x x is the midpoint class mark is the midpoint class mark xx of the of the

classclass

VarianceVariance

• Thus, the variance for a grouped Thus, the variance for a grouped data of a population is:data of a population is:

δδ2 2 == SS/ SS/ ΣΣf f

The variance for a grouped data of a The variance for a grouped data of a sample , denoted bysample , denoted by s s22, is defined by , is defined by the following formula:the following formula:

ss2 2 == SS/ (SS/ (ΣΣf -1f -1))

Standard DeviationStandard Deviation

• The standard deviation is defined as the The standard deviation is defined as the positive square root of the variance.positive square root of the variance.

• The standard deviation of a population is The standard deviation of a population is denoted by denoted by δδ. The standard deviation of the . The standard deviation of the sample is denoted by sample is denoted by ss; ; thus:thus:

δδ = √ = √δδ22

s s = √= √ss22

• If the standard deviation of the population If the standard deviation of the population is given the standard deviation of a sample is given the standard deviation of a sample is derived from the following formula:is derived from the following formula:

s s = = δδ √ √nn

Standard ScoreStandard Score

• A measure that takes into account the A measure that takes into account the dispersion of the scores is called standard dispersion of the scores is called standard scores.scores.

• Standard score allows also analysts to make Standard score allows also analysts to make comparisons from different distributions, thus comparisons from different distributions, thus giving him the ability to decide on ranking.giving him the ability to decide on ranking.

• A standard score denoted as A standard score denoted as zz is defined as: is defined as:

Standard score = Deviation Score/Standard DeviationStandard score = Deviation Score/Standard Deviation

z z = (= (x – x – μμ) / ) / δδ

Example ProblemExample Problem

• Consider this problem. Pedro scores 700 Consider this problem. Pedro scores 700 on the math portion of a national test A. on the math portion of a national test A. Pablo scores 24 on another national test Pablo scores 24 on another national test B. The mean and the standard deviation of B. The mean and the standard deviation of test A are 500 and 100 respectively, while test A are 500 and 100 respectively, while that of test B are 18 and 6. If both tests that of test B are 18 and 6. If both tests are regarded as measures for the same are regarded as measures for the same kind of ability, which person is better?kind of ability, which person is better?

• To answer the question we need some To answer the question we need some method of comparison to compare scores method of comparison to compare scores from different distributions. from different distributions.

AnswerAnswer

• Using deviation score, where Pedro has Using deviation score, where Pedro has 200 (700-500) and Pablo has 6 (24-18), 200 (700-500) and Pablo has 6 (24-18), we cannot compare considering that the we cannot compare considering that the spread of the scores are not taken into spread of the scores are not taken into account.account.

• Using Using zz, we can now compare:, we can now compare:

For Pedro For Pedro z z = (700-500)/100 = 2= (700-500)/100 = 2

For Pablo For Pablo z z = (24-18)/6 = 1= (24-18)/6 = 1• Obviously, Pedro is better than Pablo.Obviously, Pedro is better than Pablo.

Inferential Inferential StatisticsStatistics

Knowledge Knowledge RequirementsRequirements

–The Concept of ProbabilityThe Concept of Probability–The Theorem on CountingThe Theorem on Counting–Binomial DistributionsBinomial Distributions–Normal DistributionsNormal Distributions–Sampling TheorySampling Theory–Analysis of Variance (ANOVA)Analysis of Variance (ANOVA)–Linear Regression AnalysisLinear Regression Analysis

The Concept of The Concept of ProbabilityProbability

• Probability provides the foundation of Probability provides the foundation of inferential statistics.inferential statistics.

• Using probability theory, we can Using probability theory, we can deduce the likelihood of certain deduce the likelihood of certain samples occurring with specified samples occurring with specified properties which will enable us to properties which will enable us to draw inferences about the population.draw inferences about the population.

• The probability of an event is a The probability of an event is a number between 0 and 1. number between 0 and 1.

• If If EE is an event, then is an event, then P(E) P(E) denotes the denotes the probability of probability of EE. .

The Concept of The Concept of ProbabilityProbability

• Probability satisfies the following Probability satisfies the following properties:properties:

• P ( EP ( Ei i ) ≥ 0) ≥ 0•

P ( EP ( Ei i ) ≤ 1) ≤ 1•ΣΣ P ( E P ( Ei i ) = 1) = 1

• We can assign probabilities of events to We can assign probabilities of events to occur through the use of experiments or occur through the use of experiments or empirical observation. Through the latter, empirical observation. Through the latter, the probability is called empirical the probability is called empirical probability and it is computed using the probability and it is computed using the formula:formula:

P ( E ) = P ( E ) = f / nf / n

Example Problem 1Example Problem 1

• An insurance company wants to estimate An insurance company wants to estimate the probability of a police car involved in the probability of a police car involved in a car accident. Last month, 7 cars out of a car accident. Last month, 7 cars out of 20 police cars insured were involved in 20 police cars insured were involved in accidents. What is the estimated accidents. What is the estimated probability? What is the chance that a probability? What is the chance that a police car will not be involved in an police car will not be involved in an accident?accident?

• Answers: a. P( E ) = Answers: a. P( E ) = f /nf /n; P( E ) = 7/20 ; P( E ) = 7/20 = .35= .35

b. 1 – P( E ) = 1 - .35 = .65b. 1 – P( E ) = 1 - .35 = .65

Example Problem 2Example Problem 2

An Entrance Test Math scores for a large An Entrance Test Math scores for a large university is grouped as shown:university is grouped as shown:

ScoresScores ff

200-299200-299 3,6003,600

300-399300-399 11,90011,900

400-499400-499 12,00012,000

500-599500-599 5,5005,500

600-699600-699 1,5001,500

700-799700-799 500500

• If a student is selected at random, what If a student is selected at random, what is the probability that the student’s math is the probability that the student’s math score:score:– Exceeds 399?Exceeds 399?– Is at most 599?Is at most 599?– Is between 600 – 699?Is between 600 – 699?– Is not between 400 -499?Is not between 400 -499?– Is less than or equal to 699?Is less than or equal to 699?

• Hint: Construct the relative frequency Hint: Construct the relative frequency table and the answers will be found in it.table and the answers will be found in it.

RequirementsRequirements

ScoresScores ff Relative Relative ff

200-299200-299 3,6003,600 .103.103

300-399300-399 11,90011,900 .340.340

400-499400-499 12,00012,000 .343.343

500-599500-599 5,5005,500 .157.157

600-699600-699 1,5001,500 .043.043

700-799700-799 500500 .014.014

35,00035,000

The Theorem on The Theorem on CountingCounting

• The fundamental theorem on counting The fundamental theorem on counting (FTC) states that:(FTC) states that:

If an event can occur in any one of If an event can occur in any one of mm ways, and if it has occurred, a second ways, and if it has occurred, a second event can occur in any one of event can occur in any one of nn ways, ways, then the events can occur together, in the then the events can occur together, in the order stated, in order stated, in mnmn different ways. different ways.

• The FTC is best exemplified with aid The FTC is best exemplified with aid of a decision tree that will illustrate of a decision tree that will illustrate certain decisions to be made.certain decisions to be made.


AA

BB

BB

BB

CC

CC

CC

CCCC

CC

There are two decisions to There are two decisions to be madebe made 1. At A, which road to 1. At A, which road to B?B? 2. At B, which road to 2. At B, which road to C?C?Question: How many ways Question: How many ways to go to go from A to C from A to C

Hint: At A, there are 3 waysHint: At A, there are 3 ways At B, there are 2 waysAt B, there are 2 ways Use FTCUse FTC

Binomial DistributionsBinomial Distributions• The binomial distribution describes the The binomial distribution describes the

probability distribution of successes and failures.probability distribution of successes and failures.• In a binomial experiment, the following In a binomial experiment, the following

properties are known:properties are known:– The experiment consists of The experiment consists of n n identical trials.identical trials.– Each trial results in exactly one of two outcomes, called Each trial results in exactly one of two outcomes, called

success or failure. Success is denoted by success or failure. Success is denoted by S S and failure and failure by by FF..

– The The n n trials are independent.trials are independent.– The probability of success The probability of success pp, remains constant from , remains constant from

trial to trial.trial to trial.• A Table for Binomial Distribution can be used to A Table for Binomial Distribution can be used to

find values for find values for P(P(xx)) for selected values of for selected values of p p and and values of values of nn..


• If a baseball player with a batting If a baseball player with a batting average of .600 comes to bat five times average of .600 comes to bat five times in a game, what is the probability that in a game, what is the probability that he will get three hits?he will get three hits?

• Given:Given:•pp = .600 = .600•nn = 5 = 5•x x = 3 = 3

• Use the Binomial TableUse the Binomial Table

Normal DistributionNormal Distribution

A normal distribution is mound shaped as A normal distribution is mound shaped as shown or takes on a bell-shaped appearance.shown or takes on a bell-shaped appearance.

yy

xxμμ

wherewhere::y = (y = (ρρ-(-(xx--μμ))2/2/(2(2δδ22)))/√2)/√2πδπδμμ = mean of the population= mean of the populationδδ = standard deviation= standard deviationρρ = 2.718 = 2.718x x = any real number= any real number

PropertiesProperties

• A normal distribution is mound or bell shaped.A normal distribution is mound or bell shaped.• The area under the curve is always equal to 1.The area under the curve is always equal to 1.• The mean is located at the center of the The mean is located at the center of the

distribution and the curve is symmetrical about distribution and the curve is symmetrical about its mean.its mean.

• The mean, median and mode are equal.The mean, median and mode are equal.• The curve extends indefinitely to the left and The curve extends indefinitely to the left and

right of the mean and approaches the horizontal right of the mean and approaches the horizontal axis.axis.

• The curve never touches the horizontal axis.The curve never touches the horizontal axis.• The shape and position of the curve depend on The shape and position of the curve depend on

the parameters the parameters μμ and and δδ..

Empirical Rule Empirical Rule

• Approximately 68% of the measurements Approximately 68% of the measurements fall within 1 standard deviation of the fall within 1 standard deviation of the mean, that is within mean, that is within μμ ± ± δδ..

• Approximately 95% of the measurements Approximately 95% of the measurements fall within 2 standard deviations of the fall within 2 standard deviations of the mean, that is within mean, that is within μμ ± 2 ± 2δδ..

• Approximately 99.7% of the Approximately 99.7% of the measurements fall within 3 standard measurements fall within 3 standard deviations of the mean, that is within deviations of the mean, that is within μμ ± ± 33δδ..


50504040 6565

Problem: For Problem: For μμ = 50 and = 50 and δδ = 5, = 5, find P(40<find P(40<x x <65)<65)

SolutionSolution

• Transform the distribution into a Transform the distribution into a normal distribution by using the normal distribution by using the transformation formula: transformation formula: z z = = ((x – x – μμ) / ) / δδ..

• Hence, for Hence, for x x = 40, = 40, z z = = (40 – 50)/5 = -2(40 – 50)/5 = -2• And for And for x x = 65, = 65, z z = = (65 – 50)/5 = 3(65 – 50)/5 = 3• Therefore by using the normal Therefore by using the normal

distribution table, P(40<distribution table, P(40<x x <65) = P(-2 <65) = P(-2 < < z z < < 3) = .4772 + .4987 3) = .4772 + .4987 = .9759 = .9759

Sampling TheorySampling Theory

• A major concern of inferential statistics is to A major concern of inferential statistics is to estimate unknown population characteristics estimate unknown population characteristics by examining information gathered from a by examining information gathered from a sub-collection of the population called sample.sub-collection of the population called sample.

• If it were to be used to study the If it were to be used to study the characteristic of the population, a sample characteristic of the population, a sample must necessarily be a representative of the must necessarily be a representative of the population.population.

• When a complete enumeration of a population When a complete enumeration of a population called census is not possible, a sample is used. called census is not possible, a sample is used.

Random SampleRandom Sample

• Sampling bias which is an statistical bias can Sampling bias which is an statistical bias can be removed through randomization which be removed through randomization which refers to a sampling process involving the refers to a sampling process involving the selection of a sample through unbiased and selection of a sample through unbiased and impartial procedure.impartial procedure.

• The sample generated from this process is The sample generated from this process is called a random sample.called a random sample.

• Types of sampling techniques that Types of sampling techniques that characterize randomization procedures are characterize randomization procedures are simple random sampling, stratified sampling, simple random sampling, stratified sampling, cluster sampling, and systematic sampling. cluster sampling, and systematic sampling.

Analysis of VarianceAnalysis of Variance

• Analysis of variance is a statistical method used Analysis of variance is a statistical method used to test the equality of two or more population to test the equality of two or more population means.means.

• Analysis of variance is a methodology for Analysis of variance is a methodology for analyzing the variation between samples and the analyzing the variation between samples and the variation within samples using variances, rather variation within samples using variances, rather than ranges.than ranges.

• Analysis of variance enables us to test Analysis of variance enables us to test hypotheses such as:hypotheses such as:– HHoo: : μμ11 = = μμ22 = = μμ33 = … = = … = μμkk

– HH11: At least two population means are unequal.: At least two population means are unequal.

Linear Regression Linear Regression AnalysisAnalysis

• Linear regression analysis is a procedure to Linear regression analysis is a procedure to analyze the relationship of two types of analyze the relationship of two types of variables where one is dependent upon one or variables where one is dependent upon one or more independent variables. more independent variables.

• A simple linear regression model involves only A simple linear regression model involves only one dependent variable influenced by one one dependent variable influenced by one dependent variable. It is estimated using the dependent variable. It is estimated using the least squares prediction equation as shown:least squares prediction equation as shown:

yy = b = b00 + b + b11xx• A multiple regression model involves one A multiple regression model involves one

dependent variable influenced by two or more dependent variable influenced by two or more independent variables:independent variables:

y y = b= b00 + b + b11xx1 1 + b+ b22xx2 2 + + bb33xx3 3 + . . .+ . . . bbnnxxnn

Estimation of ParametersEstimation of Parameters

• In simple linear regression, the parameters bIn simple linear regression, the parameters b0 0

and band b1 1 can be estimated by using the following can be estimated by using the following formulas:formulas:

SSSSx x = = ΣΣxx 22– (– (ΣΣxx))22//nn SSSSy y = = ΣΣyy

22– (– (ΣΣyy))22//nn

SSSSxy xy = = ΣΣxy xy – (– (ΣΣxxΣΣyy)/)/nn y = y = ΣΣyy//n x = n x = ΣΣxx//nn

bb11 = SS= SSxyxy// SSSSxx bb0 0 = = y y - - bb1 1 xx• In multiple linear regression, the estimation of In multiple linear regression, the estimation of

the parameters is quite complicated but there the parameters is quite complicated but there are available computer programs that can are available computer programs that can provide solutions to the problem.provide solutions to the problem.


An information on An information on mileage ratings was mileage ratings was released comparing released comparing engine size (in engine size (in cubic inches of cubic inches of displacement) and displacement) and miles per gallon miles per gallon (mpg) estimates for (mpg) estimates for eight representative eight representative models of running models of running compact cars in the compact cars in the US as shown:US as shown:

CarCar cdicdi mpgmpg

CavalieCavalierr

121121 3030

StanzaStanza 120120 3131

OmniOmni 9797 3434

EscortEscort 9898 2727

MazdaMazda 122122 2929

HorizoHorizonn

9797 3434

EncoreEncore 8585 3838

CorollaCorolla 122122 3232

RequirementRequirement

• Suppose we want to find out if there is Suppose we want to find out if there is an existing relationship between engine an existing relationship between engine displacement and mileage per gallon displacement and mileage per gallon where engine displacement necessarily where engine displacement necessarily affects the mileage ratings of the car.affects the mileage ratings of the car.

• What relationship can be established?What relationship can be established?• Hint: Use the least squares prediction Hint: Use the least squares prediction

equation to estimate the linear equation to estimate the linear regression model. regression model.

SolutionSolution• To find the relationship, we first establish the To find the relationship, we first establish the

following:following:y = y = mpgmpg

x = x = cdicdi• Given the above notation, the relationship will Given the above notation, the relationship will

be described by the following equation using be described by the following equation using the predictor model:the predictor model:

y y = = bb00 + b + b11xx• After the relationship has been established, we After the relationship has been established, we

now find the parameters now find the parameters bb0 0 and and bb11 by applying by applying the formulas. However, these formulas can be the formulas. However, these formulas can be meaningful only after we compute for the meaningful only after we compute for the following data.following data.

xx yy xx22 yy22 xyxy

121121 3030 14,64114,641 900900 3,6303,630

120120 3131 14,40014,400 961961 3,7203,720

9797 3434 9,4099,409 1,1561,156 3,2983,298

9898 2727 9,6049,604 729729 2,6462,646

122122 2929 14,88414,884 841841 3,5383,538

9797 3434 9,4099,409 1,1561,156 3,2983,298

8585 3838 7,2257,225 1,4441,444 3,2303,230

122122 3232 14,88414,884 1,0241,024 3,9043,904

862862 255255 94,45694,456 8,2118,211 27,26427,264

• SSSSx x = = ΣΣxx 22– (– (ΣΣxx))22//n = n = 94,456 – (862)94,456 – (862)22/8/8

= 1575.5= 1575.5

• SSSSy y = = ΣΣyy 22– (– (ΣΣyy))22//n = n = 8,211 – (255)8,211 – (255)22/8/8

= 82.875= 82.875

• SSSSxy xy = = ΣΣxy xy – (– (ΣΣxxΣΣyy)/)/n = n = 27,264 – (862)27,264 – (862)(255)/8(255)/8

= – 212.25= – 212.25

• bb1 1 = = SSSSxyxy/ / SSSSx x = = – 212.25/1575.5 = – .1347– 212.25/1575.5 = – .1347

• bb0 0 = y = y - - bb1 1 x = x = ΣΣyy//n n – (– .1347)(– (– .1347)(ΣΣxx//n)n)

= (= (255/8) + (.1347)(862/8) = 46.3889255/8) + (.1347)(862/8) = 46.3889• Therefore, Therefore, y y == 46.3889 – .134746.3889 – .1347xx

AnalysisAnalysis

• With With y y == 46.3889 – .134746.3889 – .1347x , x , we can now we can now estimate the mileage rating of a car model estimate the mileage rating of a car model given its engine displacement.given its engine displacement.

• However, the estimate can only be true if However, the estimate can only be true if there is a close correlation between the two there is a close correlation between the two parameters considered. This is another aspect parameters considered. This is another aspect which statistics thru the correlation analysis which statistics thru the correlation analysis has to prove before the model can become has to prove before the model can become useful in predicting characteristics of a useful in predicting characteristics of a variable. Commonly used to test correlation is variable. Commonly used to test correlation is the Spearman’s rank correlation coefficient the Spearman’s rank correlation coefficient denote by denote by rrss..

Specification CriteriaSpecification Criteria• 1.1. Theory.Theory. Is there a theoretically sound Is there a theoretically sound

justification for including the variable?justification for including the variable?• 2.2. t-test.t-test. Is the variable’s estimated Is the variable’s estimated

coefficient statistically significant and of coefficient statistically significant and of the expected sign?the expected sign?

• 3.3. Does the overall fit of the equation Does the overall fit of the equation improve when the variable is added?improve when the variable is added?

• 4.4. Bias.Bias. Does another variable’s Does another variable’s coefficient change significantly when the coefficient change significantly when the variable is added to the equation?variable is added to the equation?

PROBLEMSPROBLEMS

• Contemporaneous Contemporaneous correlation between the correlation between the independent variables and independent variables and the disturbance termthe disturbance term

• HeteroscedasticityHeteroscedasticity• AutocorrelationAutocorrelation

Further AnalysisFurther Analysis

• Moreover, the appropriateness of the Moreover, the appropriateness of the regression model is further tested in order to regression model is further tested in order to prove that indeed the model is good enough prove that indeed the model is good enough for predictive purposes. The for predictive purposes. The FF statistic is the statistic is the tool to test its appropriateness.tool to test its appropriateness.

• In other words, as illustrated by this particular In other words, as illustrated by this particular model, statistics is not as what it has been model, statistics is not as what it has been presented yet. There is more to learn. And if presented yet. There is more to learn. And if you are interested in pursuing the knowledge you are interested in pursuing the knowledge it is enjoined that you acquire more books that it is enjoined that you acquire more books that will provide you with the information if not will provide you with the information if not study further formally the subject matter.study further formally the subject matter.

Problem Obj

Scales

Description of a Single

Population

Comparison of Two

Population

Comparison of Two or

More Population

Analysis of the

Relationship Between Two

Variables

Analysis of the

Relationship Among Two

or More Variables

Nominal

z test and estimator of p

X2 test of the multinomial experiment

z test and estimator of(p1 – p2 )

X2 test of a contingency table

Ordinal

Wilcoxon rank sum test for independent samples and for matched pairs

Kruskal-Wallis test for the completely randomized design

Freidman test for the randomized block design

Spearman rank correlation

Interval

z test and estimator of ut test and estimator of uX2 test and estimator of variance

z test and estimator of u1-u2

t test and estimator of up = u1-u2

F test and estimator of variance1

2/variance

22

ANOVA: Completely randomized design

ANOVA: randomized block design

Simple Linear regression and correlation

Multiple Regression Analysis

STATISTICS PNPCOMPTROLLERSHIPCOURSE. Statistics The term has two meanings.The term has two meanings....

Documents

Transcript of STATISTICS PNPCOMPTROLLERSHIPCOURSE. Statistics The term has two meanings.The term has two meanings....