STATISTICS PNPCOMPTROLLERSHIPCOURSE. Statistics The term has two meanings.The term has two meanings....
-
Upload
hubert-walsh -
Category
Documents
-
view
218 -
download
0
Transcript of STATISTICS PNPCOMPTROLLERSHIPCOURSE. Statistics The term has two meanings.The term has two meanings....
STATISTICSSTATISTICS
PNP PNP
COMPTROLLERSHIPCOMPTROLLERSHIP
COURSECOURSE
StatisticsStatistics
• The term has two meanings.The term has two meanings.• Statistics (singular) is the science Statistics (singular) is the science
of collecting, organizing, of collecting, organizing, analyzing, and interpreting analyzing, and interpreting information.information.
• Statistics (plural) are numbers Statistics (plural) are numbers calculated from a set or collection calculated from a set or collection of information.of information.
General CategoriesGeneral Categories
• Descriptive Statistics comprises Descriptive Statistics comprises those methods used to organize and those methods used to organize and describe information that has been describe information that has been collected.collected.
• Inferential Statistics involves the Inferential Statistics involves the theory of probability and comprises theory of probability and comprises those methods and techniques for those methods and techniques for making generalizations, predictions, making generalizations, predictions, or estimates about the population by or estimates about the population by using limited information.using limited information.
Descriptive Descriptive StatisticsStatistics
Organizing DataOrganizing Data
• Data are the building blocks of statistics.Data are the building blocks of statistics.• They are generally categorized as They are generally categorized as
quantitative or qualitative.quantitative or qualitative.• They are also classified according to the They are also classified according to the
type of measurement scale used such as:type of measurement scale used such as:– Nominal scaleNominal scale– Ordinal scaleOrdinal scale– Interval scaleInterval scale– Ratio scaleRatio scale
Nominal ScaleNominal Scale
• Nominal scale exists for both the Nominal scale exists for both the quantitative and qualitative data.quantitative and qualitative data.– Nominal scale for quantitative data assigns Nominal scale for quantitative data assigns
numbers to categories to distinguish one numbers to categories to distinguish one from another such as basketball jerseys, from another such as basketball jerseys, postal zip codes, and telephone numbers.postal zip codes, and telephone numbers.
– Nominal scale for qualitative data is an Nominal scale for qualitative data is an unordered grouping of data into discrete unordered grouping of data into discrete categories where each datum can go into categories where each datum can go into only one group such as sex, blood type, or only one group such as sex, blood type, or religion..religion..
Ordinal ScaleOrdinal Scale
• Data measured on a nominal scale Data measured on a nominal scale that is ordered in some fashion are that is ordered in some fashion are referred to as ordinal data.referred to as ordinal data.– Letter grades as A, B, C, D, and FLetter grades as A, B, C, D, and F– Ranks as Inspector, Sr Inspector, Chief Ranks as Inspector, Sr Inspector, Chief
InspectorInspector– Residence numberResidence number– Performance Rating as Poor, Fair, GoodPerformance Rating as Poor, Fair, Good– Grades in school as 1, 2, 3 and so on.Grades in school as 1, 2, 3 and so on.
Interval ScaleInterval Scale
• Data measured on an ordinal scale Data measured on an ordinal scale for which distances between values for which distances between values are calculated are called interval are calculated are called interval data.data.
• The distance between two values is The distance between two values is relevant.relevant.
• Interval data are necessarily Interval data are necessarily quantitative.quantitative.
• An interval scale does not have An interval scale does not have necessarily a zero point, a point necessarily a zero point, a point which indicates the absence of what which indicates the absence of what we are measuring.we are measuring.
ExampleExample
• IQ test scores. We can say an IQ score IQ test scores. We can say an IQ score of 180 is higher than an IQ score of 90. of 180 is higher than an IQ score of 90. We can also say that it is 90 points We can also say that it is 90 points higher. But we cannot say that a person higher. But we cannot say that a person with an IQ score of 180 is twice as smart with an IQ score of 180 is twice as smart as a person with an IQ score of 90. as a person with an IQ score of 90. Likewise, a given difference between Likewise, a given difference between two IQ scores does not have always the two IQ scores does not have always the same meaning. Say 100-90, and 150-same meaning. Say 100-90, and 150-140, may have different interpretations 140, may have different interpretations even if the difference is the same as 10. even if the difference is the same as 10.
Another ExampleAnother Example
• Celsius Temperature. A temperature Celsius Temperature. A temperature of 80degrees C is 40 degrees of 80degrees C is 40 degrees warmer than a temperature of warmer than a temperature of 40degrees C. But it is not correct to 40degrees C. But it is not correct to say that 80degrees C is twice as say that 80degrees C is twice as warm as 40degrees C. Note that warm as 40degrees C. Note that 0degree C does not represent the 0degree C does not represent the absence of heat or zero heat. The absence of heat or zero heat. The absence of heat is represented by absence of heat is represented by 0degree Kelvin equivalent to -0degree Kelvin equivalent to -273degrees C. 273degrees C.
Ratio ScaleRatio Scale
• Data measured on an interval scale with a Data measured on an interval scale with a zero point meaning “none” are called ratio zero point meaning “none” are called ratio data. Because the zero point of the Celsius data. Because the zero point of the Celsius scale does not represent the absence of scale does not represent the absence of heat, the Celsius scale is not a ratio scale. heat, the Celsius scale is not a ratio scale. The Kelvin scale is a ratio scale. Examples The Kelvin scale is a ratio scale. Examples of other ratio scales are those commonly of other ratio scales are those commonly used to measure units such as feet, used to measure units such as feet, meters, pounds, and pesos. The results of meters, pounds, and pesos. The results of counting objects are also ratio data.counting objects are also ratio data.
Organizing Data Using Organizing Data Using TablesTables
• The objective of organizing data is to arrange The objective of organizing data is to arrange a set of data into useful form in order to a set of data into useful form in order to reveal essential features and simplify certain reveal essential features and simplify certain analyses.analyses.
• Data that are not organized in some fashion Data that are not organized in some fashion are called raw data.are called raw data.
• One method of arranging data is to construct One method of arranging data is to construct an ordered array; that is arranging data from an ordered array; that is arranging data from low to high (or high to low).low to high (or high to low).
• If the number of data is large, the data may If the number of data is large, the data may be difficult to manage, thus tables are often be difficult to manage, thus tables are often used as a general approach to organizing raw used as a general approach to organizing raw data.data.
Ungrouped Frequency Ungrouped Frequency TablesTables
• The frequency of a measurement or The frequency of a measurement or category is the total number of times the category is the total number of times the measurement or category occurs in a measurement or category occurs in a collection of data. The symbol collection of data. The symbol f f is used is used to denote the frequency of a to denote the frequency of a measurement.measurement.
• For example: A sample data For example: A sample data representing the number of free throws representing the number of free throws missed by a basketball team during the missed by a basketball team during the last 7 games:last 7 games:
77 22 88 44 22 77 22
Frequency Table of Free Frequency Table of Free Throw DataThrow Data
Data xData x ff
22 33
44 11
77 22
88 11
77
Tally MarksTally Marks
• For a very large number of data, an For a very large number of data, an intermediate step is to count intermediate step is to count observations through the use of tally observations through the use of tally marks to aid in determining the marks to aid in determining the frequency frequency ff for each observation. for each observation.
• Corresponding to each observation Corresponding to each observation we place a tally mark in a tally we place a tally mark in a tally column.column.
• After all tallies are placed, they are After all tallies are placed, they are counted for each measurement counted for each measurement xx to to determine the frequency.determine the frequency.
ExampleExample
99 88 77 88 44 33
22 11 00 55 33 22
11 11 77 33 22 88
77 66 66 44 33 22
22 00 99 44 66 99
66 99 44 33 55 77
33 22 11 44 44 22
xx Frequency Frequency ff
00 22
11 44
22 77
33 66
44 66
55 22
66 44
77 44
88 33
99 44
4242
Grouped Frequency Grouped Frequency TablesTables
• A grouped frequency table shows frequencies A grouped frequency table shows frequencies according to groups or classes of measurements.according to groups or classes of measurements.
• For example, a memorial hospital wants to study For example, a memorial hospital wants to study whether its emergency room staffing is whether its emergency room staffing is adequate. To start the study, the manager tracks adequate. To start the study, the manager tracks down the number of people visiting the down the number of people visiting the emergency room each day for a 12-day period emergency room each day for a 12-day period with result as:with result as:
77 4343 88 2222 1313 2828 3636 1818 2323 2121 1515 5252
StepsSteps
• The manager constructs six groupings or classes, The manager constructs six groupings or classes, the first class representing 1-10 patients; the the first class representing 1-10 patients; the second class, 11-20 patients; 3second class, 11-20 patients; 3rdrd class, 21-30 class, 21-30 patients; 4patients; 4thth class, 31-40; 5 class, 31-40; 5thth class, 41-50; and, the class, 41-50; and, the 66thth class, 51-60. class, 51-60.
• For the first class, the lower class limit is 1 while For the first class, the lower class limit is 1 while the upper class limit is 10. The rest of the classes the upper class limit is 10. The rest of the classes will have a similar pattern of lower and upper will have a similar pattern of lower and upper limits.limits.
• Tally the number of patients that fall within each Tally the number of patients that fall within each class.class.
• Construct the grouped frequency table. Construct the grouped frequency table.
Grouped Frequency Table for Emergency Grouped Frequency Table for Emergency Room DataRoom Data
ClassClass Frequency Frequency ff
1-101-10 22
11-2011-20 33
21-3021-30 44
31-4031-40 11
41-5041-50 11
51-6051-60 11
1212
Basic GuidelinesBasic Guidelines
• Each class should Each class should have the same width.have the same width.
• No two classes No two classes should overlap.should overlap.
• Each piece of data Each piece of data should belong to a should belong to a class.class.
ClassClass FrequeFrequency ncy ff
1-101-10 22
11-2011-20 33
21-3021-30 44
31-4031-40 11
41-5041-50 11
51-6051-60 11
Class Boundaries and Class Class Boundaries and Class WidthsWidths
• Class boundaries determine class widths.Class boundaries determine class widths.• Class boundaries for grouped frequency table are Class boundaries for grouped frequency table are
determined by considering the unit or precision of determined by considering the unit or precision of measurement.measurement.
• The lower class boundary of a class interval is located The lower class boundary of a class interval is located one-half unit below the lower class limit. The upper class one-half unit below the lower class limit. The upper class boundary is one-half unit above the upper class limit.boundary is one-half unit above the upper class limit.
• The class width The class width w w for any class interval is found by for any class interval is found by subtracting the lower class boundary from the upper subtracting the lower class boundary from the upper class boundary, thus:class boundary, thus:
w w = = ll22 – – ll11
where: where: ll1 1 is the lower class boundary; and, is the lower class boundary; and,
ll2 2 is the upper class boundary for each class interval. is the upper class boundary for each class interval.
Basic Rules in ConstructingBasic Rules in Constructinga Grouped Frequency Tablea Grouped Frequency Table
• How many classes should be used?How many classes should be used?• What should be the width of each class?What should be the width of each class?• At what value should the first class start?At what value should the first class start?• How is the class mark or midpoint How is the class mark or midpoint
computed?computed?
Basic Rules in Constructing a Grouped Basic Rules in Constructing a Grouped Frequency TableFrequency Table
• For number of classes, Sturges’ rule:For number of classes, Sturges’ rule:
c = 3.3(log n) + 1c = 3.3(log n) + 1• For width, the rule is For width, the rule is w = R/cw = R/c where where RR is the is the
range computed by subtracting the smallest range computed by subtracting the smallest measurement measurement LL from the largest measurement from the largest measurement UU; ; thus, thus, R = U – LR = U – L..
• The lower limit of the first class should be near and The lower limit of the first class should be near and at most as large as the smallest measurement L.at most as large as the smallest measurement L.
• Class mark Class mark XX or theor the midpoint midpoint isis computed by computed by adding the lower class limit adding the lower class limit aa and upper class limit and upper class limit bb and dividing the sum by 2; thus, ( and dividing the sum by 2; thus, (a + b) / 2a + b) / 2
Relative Frequency TableRelative Frequency Table
• It is useful sometimes to express each It is useful sometimes to express each value or class in a frequency table as a value or class in a frequency table as a fraction of the total observations.fraction of the total observations.
• The relative frequency of a class is The relative frequency of a class is found by dividing the frequency found by dividing the frequency f f byby the total number of observations the total number of observations nn..
• The table that describes the relative The table that describes the relative frequencies is then called relative frequencies is then called relative frequency table.frequency table.
Cumulative Frequency Cumulative Frequency TableTable
• There are many occasions when we are There are many occasions when we are interested in the number of observations less interested in the number of observations less than or equal to some value. Example: A than or equal to some value. Example: A teacher may want to know the number of teacher may want to know the number of students who got a score of less than or equal students who got a score of less than or equal to 70% on an examination. The cumulative to 70% on an examination. The cumulative frequency will answer that.frequency will answer that.
• The cumulative frequency for any measurement The cumulative frequency for any measurement or class is the total of the frequency for that or class is the total of the frequency for that measurement or class and the frequencies of all measurement or class and the frequencies of all measurements or classes of smaller value.measurements or classes of smaller value.
Cumulative Relative Frequency Cumulative Relative Frequency TableTable
• Cumulative frequency tables can be constructed Cumulative frequency tables can be constructed also for tables containing relative frequencies or also for tables containing relative frequencies or percentages.percentages.
• The procedures are identical to those used for The procedures are identical to those used for cumulative frequency tables except that relative cumulative frequency tables except that relative frequencies or percentages are used.frequencies or percentages are used.
• Cumulative relative frequencies have many uses. Cumulative relative frequencies have many uses. One is in scoring standardized tests through the One is in scoring standardized tests through the percentilespercentiles method. A percentile score tells what method. A percentile score tells what part of the tested population scored lower. For part of the tested population scored lower. For example, if 50 is said to be the 90example, if 50 is said to be the 90thth percentile in percentile in an examination, it means that 90% of the scores an examination, it means that 90% of the scores were lower than 50.were lower than 50.
ExampleExample
• A final examination result has the following data.A final examination result has the following data.
1717 1515 7878 2121 1010 3232 77 6565 1818 8787
44 2222 3434 4242 99 99 8282 7979 9898 44
4444 6464 6262 7777 22 8181 4545 3737 8383 4444
7777 1313 4141 1616 1717 1313 8282 3737 55 5454
77 6767 8888 4141 6161 2222 9292 1616 6767 8585
• In constructing the frequency table, assume c = 5.In constructing the frequency table, assume c = 5.
Grouped Frequency Grouped Frequency TableTable
Class Class NumberNumber ClassClass XX ff
11 2-212-21 11.511.5 1818
22 22-4122-41 31.531.5 88
33 42-6142-61 51.551.5 66
44 62-8162-81 71.571.5 1010
55 82-10182-101 91.591.5 88
Relative Frequency TableRelative Frequency Table
ClassClass ffRelative Relative
FrequencyFrequency
2-212-21 1818 .36.36
22-4122-41 88 .16.16
42-6142-61 66 .12.12
62-8162-81 1010 .20.20
82-10182-101 88 .16.16
5050
Cumulative Frequency Cumulative Frequency TableTable
Class Class NumberNumber ClassClass Cumulative Cumulative
FrequencyFrequency
11 2-212-21 1818
22 22-4122-41 2626
33 42-6142-61 3232
44 62-8162-81 4242
55 82-10182-101 5050
Cumulative Relative Frequency Cumulative Relative Frequency TableTable
ClassClass ffRelative Relative
FrequencFrequencyy
CumulatiCumulative ve
Relative Relative FrequencFrequenc
yy
2-212-21 1818 .36.36 .36.36
22-4122-41 88 .16.16 .52.52
42-6142-61 66 .12.12 .64.64
62-8162-81 1010 .20.20 .84.84
82-10182-101 88 .16.16 1.001.00
5050
Graphical Graphical Representation of DataRepresentation of Data
0
10
20
30
40
50
60
70
80
90
1stQtr
2ndQtr
3rdQtr
4thQtr
EastWestNorth
A Bar GraphA Bar Graph
Graphical Graphical Representation of DataRepresentation of Data
1st Qtr2nd Qtr3rd Qtr4th Qtr
A Pie GraphA Pie Graph
Graphical Graphical Representation of DataRepresentation of Data
0102030405060708090
100
1stQtr
2ndQtr
3rdQtr
4thQtr
EastWestNorth
A Line GraphA Line Graph
Measures of Central Measures of Central TendenciesTendencies
• The first characteristic of a set of data that The first characteristic of a set of data that we want to measure is the center or central we want to measure is the center or central tendency. The purpose is to summarize a tendency. The purpose is to summarize a collection of data to obtain a general collection of data to obtain a general overview that will serve as a representative overview that will serve as a representative for the rest of the data.for the rest of the data.
• Common Measures of Central Tendencies:Common Measures of Central Tendencies:– MeanMean– MedianMedian– ModeMode– MidrangeMidrange
MeanMean
• The mean or arithmetic average is The mean or arithmetic average is found by adding the numbers and then found by adding the numbers and then dividing the sum by the number of dividing the sum by the number of observation observation nn::
χχ = = ΣΣx / nx / n• A population mean is denoted by:A population mean is denoted by:
μμ = = ΣΣx / Nx / N• The mean for grouped data:The mean for grouped data:
χχ = = ΣΣ((f X) / f X) / ΣΣff
Disadvantage of the Disadvantage of the MeanMean
• The mean as a measure of center has a The mean as a measure of center has a disadvantage. It is affected by the disadvantage. It is affected by the extreme measurements on one end of a extreme measurements on one end of a distribution. It depends on the value of distribution. It depends on the value of every measurement and extreme values every measurement and extreme values can lead to the mean misrepresenting the can lead to the mean misrepresenting the data.data.
• In this case, the median may provide a In this case, the median may provide a better measure than the mean inasmuch better measure than the mean inasmuch as it is not affected by the extreme values.as it is not affected by the extreme values.
MedianMedian
• In general, the median is found by first In general, the median is found by first ranking the data.ranking the data.
• If there is an odd number of If there is an odd number of observations, then the median is the observations, then the median is the number in the middle of the number in the middle of the distribution.distribution.
• If the number of observations is even, If the number of observations is even, then the median is computed by adding then the median is computed by adding the two numbers found in the middle the two numbers found in the middle positions and divide the sum by 2.positions and divide the sum by 2.
ModeMode
• The mode, if it exists, is the most frequent The mode, if it exists, is the most frequent measurement or observation.measurement or observation.
• The mode has the advantage of being The mode has the advantage of being easily found especially in small samples easily found especially in small samples and is usually not influenced by extreme and is usually not influenced by extreme measurements on one end of an ordered measurements on one end of an ordered set of data.set of data.
• Example: In an array of data arrange as Example: In an array of data arrange as follows: 1, 2, 3, 3, 3, 4, 5, and 6, the mode follows: 1, 2, 3, 3, 3, 4, 5, and 6, the mode is 3.is 3.
ModeMode
MedianMedianMedianMedian
ModeMode
ModeModeMedianMedian
Rightward skewnessRightward skewness Leftward skewnessLeftward skewness
SymmetrySymmetry
Relationships BetweenRelationships BetweenMean, Median and ModeMean, Median and Mode
Mean < MedianMean < MedianMean > MedianMean > Median
Median = ModeMedian = Mode
MeanMean MeanMean
MeanMean
Mean = MedianMean = Median
Median > ModeMedian > ModeMedian < ModeMedian < Mode
MidrangeMidrange
• The midrange of a set of data is the average The midrange of a set of data is the average of the largest and smallest measurements, of the largest and smallest measurements, thus:thus:
Midrange = (U + L) / 2Midrange = (U + L) / 2• For a data organized in a grouped For a data organized in a grouped
frequency table, the midrange is frequency table, the midrange is approximately the average of the lower approximately the average of the lower class boundary of the first class and the class boundary of the first class and the upper class boundary of the last class, thus:upper class boundary of the last class, thus:
Midrange = (Midrange = (ll1fc 1fc + + ll2lc2lc) / 2) / 2
Measures of Dispersion or Measures of Dispersion or VariabilityVariability
• Quite often, measures of central tendency alone Quite often, measures of central tendency alone do not adequately describe a characteristic being do not adequately describe a characteristic being observed.observed.
• Hence, variability is an important concept in Hence, variability is an important concept in statistics. As a result, there are many measures statistics. As a result, there are many measures of variability for a collection of quantitative data of variability for a collection of quantitative data such as:such as:
•RangeRange
•VarianceVariance
•Standard deviation Standard deviation
•Standard ScoreStandard Score
RangeRange
• As previously defined, range is the As previously defined, range is the difference between the largest and the difference between the largest and the smallest measurements; thus:smallest measurements; thus:
R = U – LR = U – L
where: R is the rangewhere: R is the range
L is the smallest L is the smallest measurementmeasurement
U is the largest U is the largest measurement measurement
Deviation ScoreDeviation Score• Deviation score is the quantity defined by this Deviation score is the quantity defined by this
relationship:relationship:
• Deviation score represents the directed Deviation score represents the directed distance a measurement has from the mean of distance a measurement has from the mean of a set of data.a set of data.
• A positive deviation score means the A positive deviation score means the measurement is above the mean; a negative measurement is above the mean; a negative means the mean is above the measurement; a means the mean is above the measurement; a zero deviation means the measurement is zero deviation means the measurement is equal to the mean.equal to the mean.
x - xx - x
Sum of SquaresSum of Squares
• By adding the deviation scores the By adding the deviation scores the resulting value will be zero, a resulting value will be zero, a useless result for analyzing a set of useless result for analyzing a set of data. To avoid this situation the sum data. To avoid this situation the sum of squares is used.of squares is used.
• Sum of squares SS is computed by Sum of squares SS is computed by first squaring the deviation scores, first squaring the deviation scores, then adding them up; thus:then adding them up; thus:SS = SS = ΣΣ( ( x – x –
x x ))22
VarianceVariance• The variance of a population of measurements The variance of a population of measurements
is defined as the average of the squared is defined as the average of the squared deviation scores denoted by deviation scores denoted by δδ22;; thus: thus: δδ2 2 == SS/N SS/N
• The variance of a sample, denoted byThe variance of a sample, denoted by s s22, is , is defined by the following formula:defined by the following formula: ss2 2 == SS/(SS/(n n – 1)– 1)
• The variance for data in frequency tables is The variance for data in frequency tables is computed by deriving first the sum of squares computed by deriving first the sum of squares using the following formula then proceeding:using the following formula then proceeding: SS = ( SS = ( ΣΣ((f xf x2 2 ) – () – (ΣΣf x f x ))22//ΣΣf f ) )Where Where x x is the midpoint class mark is the midpoint class mark xx of the of the
classclass
VarianceVariance
• Thus, the variance for a grouped Thus, the variance for a grouped data of a population is:data of a population is:
δδ2 2 == SS/ SS/ ΣΣf f
The variance for a grouped data of a The variance for a grouped data of a sample , denoted bysample , denoted by s s22, is defined by , is defined by the following formula:the following formula:
ss2 2 == SS/ (SS/ (ΣΣf -1f -1))
Standard DeviationStandard Deviation
• The standard deviation is defined as the The standard deviation is defined as the positive square root of the variance.positive square root of the variance.
• The standard deviation of a population is The standard deviation of a population is denoted by denoted by δδ. The standard deviation of the . The standard deviation of the sample is denoted by sample is denoted by ss; ; thus:thus:
δδ = √ = √δδ22
s s = √= √ss22
• If the standard deviation of the population If the standard deviation of the population is given the standard deviation of a sample is given the standard deviation of a sample is derived from the following formula:is derived from the following formula:
s s = = δδ √ √nn
Standard ScoreStandard Score
• A measure that takes into account the A measure that takes into account the dispersion of the scores is called standard dispersion of the scores is called standard scores.scores.
• Standard score allows also analysts to make Standard score allows also analysts to make comparisons from different distributions, thus comparisons from different distributions, thus giving him the ability to decide on ranking.giving him the ability to decide on ranking.
• A standard score denoted as A standard score denoted as zz is defined as: is defined as:
Standard score = Deviation Score/Standard DeviationStandard score = Deviation Score/Standard Deviation
z z = (= (x – x – μμ) / ) / δδ
Example ProblemExample Problem
• Consider this problem. Pedro scores 700 Consider this problem. Pedro scores 700 on the math portion of a national test A. on the math portion of a national test A. Pablo scores 24 on another national test Pablo scores 24 on another national test B. The mean and the standard deviation of B. The mean and the standard deviation of test A are 500 and 100 respectively, while test A are 500 and 100 respectively, while that of test B are 18 and 6. If both tests that of test B are 18 and 6. If both tests are regarded as measures for the same are regarded as measures for the same kind of ability, which person is better?kind of ability, which person is better?
• To answer the question we need some To answer the question we need some method of comparison to compare scores method of comparison to compare scores from different distributions. from different distributions.
AnswerAnswer
• Using deviation score, where Pedro has Using deviation score, where Pedro has 200 (700-500) and Pablo has 6 (24-18), 200 (700-500) and Pablo has 6 (24-18), we cannot compare considering that the we cannot compare considering that the spread of the scores are not taken into spread of the scores are not taken into account.account.
• Using Using zz, we can now compare:, we can now compare:
For Pedro For Pedro z z = (700-500)/100 = 2= (700-500)/100 = 2
For Pablo For Pablo z z = (24-18)/6 = 1= (24-18)/6 = 1• Obviously, Pedro is better than Pablo.Obviously, Pedro is better than Pablo.
Inferential Inferential StatisticsStatistics
Knowledge Knowledge RequirementsRequirements
–The Concept of ProbabilityThe Concept of Probability–The Theorem on CountingThe Theorem on Counting–Binomial DistributionsBinomial Distributions–Normal DistributionsNormal Distributions–Sampling TheorySampling Theory–Analysis of Variance (ANOVA)Analysis of Variance (ANOVA)–Linear Regression AnalysisLinear Regression Analysis
The Concept of The Concept of ProbabilityProbability
• Probability provides the foundation of Probability provides the foundation of inferential statistics.inferential statistics.
• Using probability theory, we can Using probability theory, we can deduce the likelihood of certain deduce the likelihood of certain samples occurring with specified samples occurring with specified properties which will enable us to properties which will enable us to draw inferences about the population.draw inferences about the population.
• The probability of an event is a The probability of an event is a number between 0 and 1. number between 0 and 1.
• If If EE is an event, then is an event, then P(E) P(E) denotes the denotes the probability of probability of EE. .
The Concept of The Concept of ProbabilityProbability
• Probability satisfies the following Probability satisfies the following properties:properties:
• P ( EP ( Ei i ) ≥ 0) ≥ 0•
P ( EP ( Ei i ) ≤ 1) ≤ 1•ΣΣ P ( E P ( Ei i ) = 1) = 1
• We can assign probabilities of events to We can assign probabilities of events to occur through the use of experiments or occur through the use of experiments or empirical observation. Through the latter, empirical observation. Through the latter, the probability is called empirical the probability is called empirical probability and it is computed using the probability and it is computed using the formula:formula:
P ( E ) = P ( E ) = f / nf / n
Example Problem 1Example Problem 1
• An insurance company wants to estimate An insurance company wants to estimate the probability of a police car involved in the probability of a police car involved in a car accident. Last month, 7 cars out of a car accident. Last month, 7 cars out of 20 police cars insured were involved in 20 police cars insured were involved in accidents. What is the estimated accidents. What is the estimated probability? What is the chance that a probability? What is the chance that a police car will not be involved in an police car will not be involved in an accident?accident?
• Answers: a. P( E ) = Answers: a. P( E ) = f /nf /n; P( E ) = 7/20 ; P( E ) = 7/20 = .35= .35
b. 1 – P( E ) = 1 - .35 = .65b. 1 – P( E ) = 1 - .35 = .65
Example Problem 2Example Problem 2
An Entrance Test Math scores for a large An Entrance Test Math scores for a large university is grouped as shown:university is grouped as shown:
ScoresScores ff
200-299200-299 3,6003,600
300-399300-399 11,90011,900
400-499400-499 12,00012,000
500-599500-599 5,5005,500
600-699600-699 1,5001,500
700-799700-799 500500
• If a student is selected at random, what If a student is selected at random, what is the probability that the student’s math is the probability that the student’s math score:score:– Exceeds 399?Exceeds 399?– Is at most 599?Is at most 599?– Is between 600 – 699?Is between 600 – 699?– Is not between 400 -499?Is not between 400 -499?– Is less than or equal to 699?Is less than or equal to 699?
• Hint: Construct the relative frequency Hint: Construct the relative frequency table and the answers will be found in it.table and the answers will be found in it.
RequirementsRequirements
ScoresScores ff Relative Relative ff
200-299200-299 3,6003,600 .103.103
300-399300-399 11,90011,900 .340.340
400-499400-499 12,00012,000 .343.343
500-599500-599 5,5005,500 .157.157
600-699600-699 1,5001,500 .043.043
700-799700-799 500500 .014.014
35,00035,000
The Theorem on The Theorem on CountingCounting
• The fundamental theorem on counting The fundamental theorem on counting (FTC) states that:(FTC) states that:
If an event can occur in any one of If an event can occur in any one of mm ways, and if it has occurred, a second ways, and if it has occurred, a second event can occur in any one of event can occur in any one of nn ways, ways, then the events can occur together, in the then the events can occur together, in the order stated, in order stated, in mnmn different ways. different ways.
• The FTC is best exemplified with aid The FTC is best exemplified with aid of a decision tree that will illustrate of a decision tree that will illustrate certain decisions to be made.certain decisions to be made.
Example ProblemExample Problem
AA
BB
BB
BB
CC
CC
CC
CCCC
CC
There are two decisions to There are two decisions to be madebe made 1. At A, which road to 1. At A, which road to B?B? 2. At B, which road to 2. At B, which road to C?C?Question: How many ways Question: How many ways to go to go from A to C from A to C
Hint: At A, there are 3 waysHint: At A, there are 3 ways At B, there are 2 waysAt B, there are 2 ways Use FTCUse FTC
Binomial DistributionsBinomial Distributions• The binomial distribution describes the The binomial distribution describes the
probability distribution of successes and failures.probability distribution of successes and failures.• In a binomial experiment, the following In a binomial experiment, the following
properties are known:properties are known:– The experiment consists of The experiment consists of n n identical trials.identical trials.– Each trial results in exactly one of two outcomes, called Each trial results in exactly one of two outcomes, called
success or failure. Success is denoted by success or failure. Success is denoted by S S and failure and failure by by FF..
– The The n n trials are independent.trials are independent.– The probability of success The probability of success pp, remains constant from , remains constant from
trial to trial.trial to trial.• A Table for Binomial Distribution can be used to A Table for Binomial Distribution can be used to
find values for find values for P(P(xx)) for selected values of for selected values of p p and and values of values of nn..
Example ProblemExample Problem
• If a baseball player with a batting If a baseball player with a batting average of .600 comes to bat five times average of .600 comes to bat five times in a game, what is the probability that in a game, what is the probability that he will get three hits?he will get three hits?
• Given:Given:•pp = .600 = .600•nn = 5 = 5•x x = 3 = 3
• Use the Binomial TableUse the Binomial Table
Normal DistributionNormal Distribution
A normal distribution is mound shaped as A normal distribution is mound shaped as shown or takes on a bell-shaped appearance.shown or takes on a bell-shaped appearance.
yy
xxμμ
wherewhere::y = (y = (ρρ-(-(xx--μμ))2/2/(2(2δδ22)))/√2)/√2πδπδμμ = mean of the population= mean of the populationδδ = standard deviation= standard deviationρρ = 2.718 = 2.718x x = any real number= any real number
PropertiesProperties
• A normal distribution is mound or bell shaped.A normal distribution is mound or bell shaped.• The area under the curve is always equal to 1.The area under the curve is always equal to 1.• The mean is located at the center of the The mean is located at the center of the
distribution and the curve is symmetrical about distribution and the curve is symmetrical about its mean.its mean.
• The mean, median and mode are equal.The mean, median and mode are equal.• The curve extends indefinitely to the left and The curve extends indefinitely to the left and
right of the mean and approaches the horizontal right of the mean and approaches the horizontal axis.axis.
• The curve never touches the horizontal axis.The curve never touches the horizontal axis.• The shape and position of the curve depend on The shape and position of the curve depend on
the parameters the parameters μμ and and δδ..
Empirical Rule Empirical Rule
• Approximately 68% of the measurements Approximately 68% of the measurements fall within 1 standard deviation of the fall within 1 standard deviation of the mean, that is within mean, that is within μμ ± ± δδ..
• Approximately 95% of the measurements Approximately 95% of the measurements fall within 2 standard deviations of the fall within 2 standard deviations of the mean, that is within mean, that is within μμ ± 2 ± 2δδ..
• Approximately 99.7% of the Approximately 99.7% of the measurements fall within 3 standard measurements fall within 3 standard deviations of the mean, that is within deviations of the mean, that is within μμ ± ± 33δδ..
Example ProblemExample Problem
50504040 6565
Problem: For Problem: For μμ = 50 and = 50 and δδ = 5, = 5, find P(40<find P(40<x x <65)<65)
SolutionSolution
• Transform the distribution into a Transform the distribution into a normal distribution by using the normal distribution by using the transformation formula: transformation formula: z z = = ((x – x – μμ) / ) / δδ..
• Hence, for Hence, for x x = 40, = 40, z z = = (40 – 50)/5 = -2(40 – 50)/5 = -2• And for And for x x = 65, = 65, z z = = (65 – 50)/5 = 3(65 – 50)/5 = 3• Therefore by using the normal Therefore by using the normal
distribution table, P(40<distribution table, P(40<x x <65) = P(-2 <65) = P(-2 < < z z < < 3) = .4772 + .4987 3) = .4772 + .4987 = .9759 = .9759
Sampling TheorySampling Theory
• A major concern of inferential statistics is to A major concern of inferential statistics is to estimate unknown population characteristics estimate unknown population characteristics by examining information gathered from a by examining information gathered from a sub-collection of the population called sample.sub-collection of the population called sample.
• If it were to be used to study the If it were to be used to study the characteristic of the population, a sample characteristic of the population, a sample must necessarily be a representative of the must necessarily be a representative of the population.population.
• When a complete enumeration of a population When a complete enumeration of a population called census is not possible, a sample is used. called census is not possible, a sample is used.
Random SampleRandom Sample
• Sampling bias which is an statistical bias can Sampling bias which is an statistical bias can be removed through randomization which be removed through randomization which refers to a sampling process involving the refers to a sampling process involving the selection of a sample through unbiased and selection of a sample through unbiased and impartial procedure.impartial procedure.
• The sample generated from this process is The sample generated from this process is called a random sample.called a random sample.
• Types of sampling techniques that Types of sampling techniques that characterize randomization procedures are characterize randomization procedures are simple random sampling, stratified sampling, simple random sampling, stratified sampling, cluster sampling, and systematic sampling. cluster sampling, and systematic sampling.
Analysis of VarianceAnalysis of Variance
• Analysis of variance is a statistical method used Analysis of variance is a statistical method used to test the equality of two or more population to test the equality of two or more population means.means.
• Analysis of variance is a methodology for Analysis of variance is a methodology for analyzing the variation between samples and the analyzing the variation between samples and the variation within samples using variances, rather variation within samples using variances, rather than ranges.than ranges.
• Analysis of variance enables us to test Analysis of variance enables us to test hypotheses such as:hypotheses such as:– HHoo: : μμ11 = = μμ22 = = μμ33 = … = = … = μμkk
– HH11: At least two population means are unequal.: At least two population means are unequal.
Linear Regression Linear Regression AnalysisAnalysis
• Linear regression analysis is a procedure to Linear regression analysis is a procedure to analyze the relationship of two types of analyze the relationship of two types of variables where one is dependent upon one or variables where one is dependent upon one or more independent variables. more independent variables.
• A simple linear regression model involves only A simple linear regression model involves only one dependent variable influenced by one one dependent variable influenced by one dependent variable. It is estimated using the dependent variable. It is estimated using the least squares prediction equation as shown:least squares prediction equation as shown:
yy = b = b00 + b + b11xx• A multiple regression model involves one A multiple regression model involves one
dependent variable influenced by two or more dependent variable influenced by two or more independent variables:independent variables:
y y = b= b00 + b + b11xx1 1 + b+ b22xx2 2 + + bb33xx3 3 + . . .+ . . . bbnnxxnn
Estimation of ParametersEstimation of Parameters
• In simple linear regression, the parameters bIn simple linear regression, the parameters b0 0
and band b1 1 can be estimated by using the following can be estimated by using the following formulas:formulas:
SSSSx x = = ΣΣxx 22– (– (ΣΣxx))22//nn SSSSy y = = ΣΣyy
22– (– (ΣΣyy))22//nn
SSSSxy xy = = ΣΣxy xy – (– (ΣΣxxΣΣyy)/)/nn y = y = ΣΣyy//n x = n x = ΣΣxx//nn
bb11 = SS= SSxyxy// SSSSxx bb0 0 = = y y - - bb1 1 xx• In multiple linear regression, the estimation of In multiple linear regression, the estimation of
the parameters is quite complicated but there the parameters is quite complicated but there are available computer programs that can are available computer programs that can provide solutions to the problem.provide solutions to the problem.
Example ProblemExample Problem
An information on An information on mileage ratings was mileage ratings was released comparing released comparing engine size (in engine size (in cubic inches of cubic inches of displacement) and displacement) and miles per gallon miles per gallon (mpg) estimates for (mpg) estimates for eight representative eight representative models of running models of running compact cars in the compact cars in the US as shown:US as shown:
CarCar cdicdi mpgmpg
CavalieCavalierr
121121 3030
StanzaStanza 120120 3131
OmniOmni 9797 3434
EscortEscort 9898 2727
MazdaMazda 122122 2929
HorizoHorizonn
9797 3434
EncoreEncore 8585 3838
CorollaCorolla 122122 3232
RequirementRequirement
• Suppose we want to find out if there is Suppose we want to find out if there is an existing relationship between engine an existing relationship between engine displacement and mileage per gallon displacement and mileage per gallon where engine displacement necessarily where engine displacement necessarily affects the mileage ratings of the car.affects the mileage ratings of the car.
• What relationship can be established?What relationship can be established?• Hint: Use the least squares prediction Hint: Use the least squares prediction
equation to estimate the linear equation to estimate the linear regression model. regression model.
SolutionSolution• To find the relationship, we first establish the To find the relationship, we first establish the
following:following:y = y = mpgmpg
x = x = cdicdi• Given the above notation, the relationship will Given the above notation, the relationship will
be described by the following equation using be described by the following equation using the predictor model:the predictor model:
y y = = bb00 + b + b11xx• After the relationship has been established, we After the relationship has been established, we
now find the parameters now find the parameters bb0 0 and and bb11 by applying by applying the formulas. However, these formulas can be the formulas. However, these formulas can be meaningful only after we compute for the meaningful only after we compute for the following data.following data.
xx yy xx22 yy22 xyxy
121121 3030 14,64114,641 900900 3,6303,630
120120 3131 14,40014,400 961961 3,7203,720
9797 3434 9,4099,409 1,1561,156 3,2983,298
9898 2727 9,6049,604 729729 2,6462,646
122122 2929 14,88414,884 841841 3,5383,538
9797 3434 9,4099,409 1,1561,156 3,2983,298
8585 3838 7,2257,225 1,4441,444 3,2303,230
122122 3232 14,88414,884 1,0241,024 3,9043,904
862862 255255 94,45694,456 8,2118,211 27,26427,264
• SSSSx x = = ΣΣxx 22– (– (ΣΣxx))22//n = n = 94,456 – (862)94,456 – (862)22/8/8
= 1575.5= 1575.5
• SSSSy y = = ΣΣyy 22– (– (ΣΣyy))22//n = n = 8,211 – (255)8,211 – (255)22/8/8
= 82.875= 82.875
• SSSSxy xy = = ΣΣxy xy – (– (ΣΣxxΣΣyy)/)/n = n = 27,264 – (862)27,264 – (862)(255)/8(255)/8
= – 212.25= – 212.25
• bb1 1 = = SSSSxyxy/ / SSSSx x = = – 212.25/1575.5 = – .1347– 212.25/1575.5 = – .1347
• bb0 0 = y = y - - bb1 1 x = x = ΣΣyy//n n – (– .1347)(– (– .1347)(ΣΣxx//n)n)
= (= (255/8) + (.1347)(862/8) = 46.3889255/8) + (.1347)(862/8) = 46.3889• Therefore, Therefore, y y == 46.3889 – .134746.3889 – .1347xx
AnalysisAnalysis
• With With y y == 46.3889 – .134746.3889 – .1347x , x , we can now we can now estimate the mileage rating of a car model estimate the mileage rating of a car model given its engine displacement.given its engine displacement.
• However, the estimate can only be true if However, the estimate can only be true if there is a close correlation between the two there is a close correlation between the two parameters considered. This is another aspect parameters considered. This is another aspect which statistics thru the correlation analysis which statistics thru the correlation analysis has to prove before the model can become has to prove before the model can become useful in predicting characteristics of a useful in predicting characteristics of a variable. Commonly used to test correlation is variable. Commonly used to test correlation is the Spearman’s rank correlation coefficient the Spearman’s rank correlation coefficient denote by denote by rrss..
Specification CriteriaSpecification Criteria• 1.1. Theory.Theory. Is there a theoretically sound Is there a theoretically sound
justification for including the variable?justification for including the variable?• 2.2. t-test.t-test. Is the variable’s estimated Is the variable’s estimated
coefficient statistically significant and of coefficient statistically significant and of the expected sign?the expected sign?
• 3.3. Does the overall fit of the equation Does the overall fit of the equation improve when the variable is added?improve when the variable is added?
• 4.4. Bias.Bias. Does another variable’s Does another variable’s coefficient change significantly when the coefficient change significantly when the variable is added to the equation?variable is added to the equation?
PROBLEMSPROBLEMS
• Contemporaneous Contemporaneous correlation between the correlation between the independent variables and independent variables and the disturbance termthe disturbance term
• HeteroscedasticityHeteroscedasticity• AutocorrelationAutocorrelation
Further AnalysisFurther Analysis
• Moreover, the appropriateness of the Moreover, the appropriateness of the regression model is further tested in order to regression model is further tested in order to prove that indeed the model is good enough prove that indeed the model is good enough for predictive purposes. The for predictive purposes. The FF statistic is the statistic is the tool to test its appropriateness.tool to test its appropriateness.
• In other words, as illustrated by this particular In other words, as illustrated by this particular model, statistics is not as what it has been model, statistics is not as what it has been presented yet. There is more to learn. And if presented yet. There is more to learn. And if you are interested in pursuing the knowledge you are interested in pursuing the knowledge it is enjoined that you acquire more books that it is enjoined that you acquire more books that will provide you with the information if not will provide you with the information if not study further formally the subject matter.study further formally the subject matter.
Problem Obj
Scales
Description of a Single
Population
Comparison of Two
Population
Comparison of Two or
More Population
Analysis of the
Relationship Between Two
Variables
Analysis of the
Relationship Among Two
or More Variables
Nominal
z test and estimator of p
X2 test of the multinomial experiment
z test and estimator of(p1 – p2 )
X2 test of a contingency table
Ordinal
Wilcoxon rank sum test for independent samples and for matched pairs
Kruskal-Wallis test for the completely randomized design
Freidman test for the randomized block design
Spearman rank correlation
Interval
z test and estimator of ut test and estimator of uX2 test and estimator of variance
z test and estimator of u1-u2
t test and estimator of up = u1-u2
F test and estimator of variance1
2/variance
22
ANOVA: Completely randomized design
ANOVA: randomized block design
Simple Linear regression and correlation
Multiple Regression Analysis