Lecture Stat GS
Transcript of Lecture Stat GS
-
8/3/2019 Lecture Stat GS
1/97
Introduction
STATISTICS is the most important science in thewhole world; for upon it depends the practical
application of every science and of every art; theone science essential to all political and socialadministration, all education, all organization
based on experience, for it only gives results of
our experience.-Florence Nightingale-
-
8/3/2019 Lecture Stat GS
2/97
Basic Concepts
The term statistics originated from the Latinword status, which means state.
The original definition was the sciencedealing with data about the condition of astate or community.
Statistics is the branch of science that deals withthe collection, presentation, organization,analysis, and interpretation of data.
-
8/3/2019 Lecture Stat GS
3/97
Basic Concepts
Thepopulation is the collection of all
elements under consideration in a statistical
inquiry.
The sample is a subset of a population.
The variable is a characteristic or attribute of
the elements in a collection that can assume
different values for the different elements. An observation is a realized value of a variable.
Data is the collection of observations.
-
8/3/2019 Lecture Stat GS
4/97
Basic Concepts
Example
Variable Possible Observations
S= sex of a student Male, Female
E= employment status of an employees Permanent,Temporary,Contractual
I = monthly income of a person in greater than or equal
pesos to zero
N = number of children of a teacher n = 0, 1, 2, 3, ..
H = height of a basketball player h > 0 cms, in.
-
8/3/2019 Lecture Stat GS
5/97
Basic Concepts
Example
The office ofAdmissions is studying the relationship
between the score in the entrance examination during
application and the general weighted average (GWA)upon graduation among graduates of the university from2005 to 2010.
Population:collection of all graduates of the universityfrom the years 2005 to 2010.
VariableofInterest: score in the entrance examinationand GWA
-
8/3/2019 Lecture Stat GS
6/97
Basic Concepts
The Department of Health is interested I
determining the percentage of children below
12 years old infected by the Hepatitis B virus
in Laguna in 2010.
Population: set of all children below 12 years old
in Laguna in 2010.
Variableof interest: whether or not the child has
ever been infected by the Hepatitis B virus.
-
8/3/2019 Lecture Stat GS
7/97
Basic Concepts
Theparameteris a summary measure
describing a specific characteristic of the
population.
The statisticis a summary measure describing
a specific characteristic of the sample.
-
8/3/2019 Lecture Stat GS
8/97
Fields of Statistics
Two Major fields of Statistics1. Applied Statistics Is concerned with the procedures and techniques used in
the collection, presentation, organization, analysis andinterpretation of data. We study applied statistics in order to learn how to select and
properly implement the most appropriate statistical methods thatwill provide answers to our research problem.
2. Theoretical or Mathematical Statistics
is concerned with the development of the mathematical
foundations of the methods used in applied statistics.We study mathematical statistics in order to understand the
rationale behind the statistical methods we use in analysis and toestablish new theories that will validate the use of new statisticalmethods or modifications of existing statistical methods in solvingproblems that are more complex.
-
8/3/2019 Lecture Stat GS
9/97
Two Major Areas of Interest in
Applied Statistics
1. Descriptive Statistics
Includes all the techniques used in organizing,summarizing, and presenting the data on hand.
The data on hand may have come from all the elementsof the population so that the analysis using descriptivestatistics will allow us to describe the population.
The data on hand may also come from the elements ofa selected sample. In this case, the analysis using
descriptive statistics will only allow us to describe thesample. The methods used in descriptive statistics willnot allow us to generalize about the population usingthe sample data.
-
8/3/2019 Lecture Stat GS
10/97
Two Major Areas of Interest in
Applied Statistics
In descriptive statistics, we use tables and charts,and compute for summary measures likeaverages, proportions, andpercentages.
2. Inferential Statistics We do not simply describe the sample data. Rather,
we use the sample data to form conclusions about thepopulation. Since the sample is only a subset of the
population, then we arrive at the conclusions aboutthe population using inferential statistics underconditions of uncertainty.
-
8/3/2019 Lecture Stat GS
11/97
Two Major Areas of Interest in
Applied Statistics
Example
1. A badminton player wants to know his
average score for the past 10 games.descriptive
2. Joseph wants to determine the variability of
his seven exam scores in Statistics.descriptive
-
8/3/2019 Lecture Stat GS
12/97
Two Major Areas of Interest in
Applied Statistics3. Based on last years electricity bills, Mrs. Mercado would like to forecast
the average monthly electricity bill she will pay for the next year based onher average monthly bill in the past year.
inferential
4. Efren Bata wants to estimate his chance of winning in the next World
Championship game in Billiards based on his average scores lastchampionship and the averages of the competing players.
inferential
5. Dr. Escape wants to determine the proportion spent on transportationduring the past four months using the daily records of expenditure thatshe keeps.
descriptive6. A politician wants to determine the total number of votes his rival obtained
in the sample used in the exit poll.
descriptive
-
8/3/2019 Lecture Stat GS
13/97
Steps in a Statistical InquiryStatistical inquiry is a designed research that provides
information needed to solve a research problem1. Describe the characteristic of the elements in the
population under study through the computation orestimation of a parameter such as the proportion, total,and average.
2. Compare the characteristics of the elements in thedifferent subgroups in the population through contrasts oftheir respective summary measures.
3. Justify an assertion made by the researcher about aparticular characteristic of the population or subgroups in
the population.4. Determine the nature and strength of relationships among
the different variables of interest.
5. Identify the different groups of interrelated variablesunder study.
-
8/3/2019 Lecture Stat GS
14/97
Steps in a Statistical Inquiry
6. Reveal the natural groupings of the elements in thepopulation based on the values of a set of variables.
7. Determine the effects of one or more variables on a
response variable.8. Clarify patterns and trends in the values of a variable
over time or space.
9. Predict the value of a variable based upon its
relationship with another variable.10. Forecast future values of a variable using a sequence
of observations on the same variable taken over time.
-
8/3/2019 Lecture Stat GS
15/97
Basic Steps in Performing a Statistical
Inquiry1. Identify the problem.
The researchers need to define and state the problem in a clearmanner so that they can arrive at appropriate solutions andrecommendations later on.
2. Plan the study.Some statistical inquiries do not reach completion or do not succeedin arriving at useful information for sound decision making because ofthe researchers failure to plan the study carefully.
3. Collect the data.
The investigators take extra measures to ensure the quality of thedata collected. If the collected data were incomplete, outdated,inaccurate, or worse yet, fabricated, then it will be useless to proceedwith data analysis.
There are different ways of collecting data. These are throughsurveys, observation, experiments, and use of available documenteddata.
-
8/3/2019 Lecture Stat GS
16/97
Basic Steps in Performing a Statistical
Inquiry4. Explore the data.
Prior to data analysis, the investigators need to explore andunderstand the essential features of their data. This processallows them to determine if their data satisfy the assumptionsmade in the derivation of the statistical technique that they willuse for analysis.
5. Analyze data and interpret the results.After collecting and organizing data, analysis follows. Theinvestigators once more carry out the plans specified in the researchdesign but this time on data analysis. They then examine all theresults on tables, charts, estimated summary measures, and tests ofhypotheses. They need to check that they were able to meet all of
the specific objectives. Based on the analysis carried out, theinvestigators must be able to answer the research problem and giverecommendations on how this can be useful in decision making.
The investigators must double-check the results that contradictexisting theory or the earlier hypothesis made. They may havecommitted errors in data collection or analysis. If not, they wouldhave to propose possible explanations for these results or suggestfuture statistical inquiries that could help explain the inconsistency.
-
8/3/2019 Lecture Stat GS
17/97
Basic Steps in Performing a Statistical
Inquiry
6. Present the results.
After analyzing the data and interpreting
the results, the investigators must presentthese results in a clear and concise manner to
the users of the research.
-
8/3/2019 Lecture Stat GS
18/97
Measurement
Process of determining the value or label of thevariable based on what has been observed
Ratio level of measurement has all of thefollowing properties;
A) the numbers in the system are used to classifya person/object into distinct, nonoverlapping,
and exhaustive categories; B) the system arranges the categories according
to magnitude;
-
8/3/2019 Lecture Stat GS
19/97
C) the system has a fixed unit of measurement
representing a set size throughout the scale,
and
D) the system has an absolute zero.
Examples: 1. allowance of a student (in pesos) 2. distance travelled by an airplane (in kms)
3. the speed of a car (in kms/hr)
4. height of an adult (in cms)
5. weight of a newborn baby (in kgs)
-
8/3/2019 Lecture Stat GS
20/97
Interval level of measurement satisfies only the firstthree properties of the ratio level.
The only difference between the interval and the ratiolevels is the interpretation of the value O in theirscales. The zero point in the interval level is not anabsolute zero. Unlike in the ratio scale, the zero valuein the interval scale has an arbitrary interpretation anddoes not mean the absence of the property we aremeasuring.
Examples: Temperature readings measured in degreesCentigrade (0 C), Intelligent Quotient (IQ)
If the temperature is O C, do we say that there is no
temperature? Of course not, Since O C is not anabsolute zero.
-
8/3/2019 Lecture Stat GS
21/97
Ordinal level of measurement satisfies onlythe first two properties of the ratio level.
Examples.
1. Size of shirt ( small, medium, large, extra
large) 2. Performance rating of a salesperson
measured as follows: Excellent, very good,good, satisfactory, poor
3. Faculty rank: Professor, Associate Professor,Assistant Professor, Instructor
-
8/3/2019 Lecture Stat GS
22/97
Nominal level of measurement satisfies only
the first property of the ratio level.
Examples:
1. Gender
2. Civil Status
3. Type of movies ( Action, Roamance,
Comedy. Others)
4. Major island group ( Luzon, Visayas,
Mindanao)
-
8/3/2019 Lecture Stat GS
23/97
Data Collection Methods
1. Use of available documented data inpublished or unpublished studies.
2. Surveys
3. Experiments
4. Observations
-
8/3/2019 Lecture Stat GS
24/97
Collection of Data
Primary Data
Are data documented by the primary source. The
data collectors themselves documented this data.
Secondary Data
are data documented by a secondary source. AN
individual/agency, other than the data collectors,documented this data.
-
8/3/2019 Lecture Stat GS
25/97
Collection of Data
Survey
is a method of collecting data on thevariable of interest by asking people
questions. When data came from asking allthe people in the population, then the study iscalled a census. On the other hand, when datacame from asking a sample of people selectedfrom a well-defined population, then thestudy is called a sample survey.
-
8/3/2019 Lecture Stat GS
26/97
Collection of Data
Experiment
is a method of collecting data where there
is direct human intervention on the conditionsthat may affect the values of the variable of
interest.
-
8/3/2019 Lecture Stat GS
27/97
-
8/3/2019 Lecture Stat GS
28/97
Collection of Data
Observation method
is a method of collecting data on the
phenomenon of interest by recording theobservations made about the phenomenon as
it actually happens.
-
8/3/2019 Lecture Stat GS
29/97
Collection of Data
Examples:
1) A local TV network asked voters to indicate whom theyvoted as they exited the polling booth. Survey
2) A private hospital divides terminally ill patients into twogroups, with one group receiving medication A and theother group receiving medication B. After a month, theymeasured each subjects improvement. Experiment
3) A researcher investigates the level of pollution in keypoints in Metro Manila by setting up pollution measuringdevices at selected intersections. Observation
-
8/3/2019 Lecture Stat GS
30/97
Collection of Data
Questionnaire
is a measurement instrument used in various datacollection methods, particularly surveys. We use aquestionnaire to determine and record the
measurements of characteristics of the elements in astudy such as height, weight, color, size, attitude, pastand present behavior, and opinions.
Two Types of Questionnaire Used in SurveysSelf-Administered Questionnaire
Interview Schedule
-
8/3/2019 Lecture Stat GS
31/97
Collection of Data
Types of Questions
1. Closed-ended question
is a type of question that includes a list of response
categories from which the respondent will select his
answer.
2. Open-ended question
is a type of question that does not include response
categories.
-
8/3/2019 Lecture Stat GS
32/97
Collection of DataOpen-ended Closed-ended
Advantages Respondent can freely answer Facilitates tabulation of responses
Can elicit feelings and emotions of the
respondent
Easy to code and analyze
Can reveal new ideas and views that the
researcher might not have considered
Saves time and money
Good for complex issues High response rate since it is
simple and quick to answer
Good for questions whose possible
responses are unknown
Response categories make
questions easy to understand
Allows respondent to clarify answers Can repeat the study and easily
make comparisons
Gets detailed answers
Shows how respondents think
-
8/3/2019 Lecture Stat GS
33/97
Collection of DataOpen-ended Closed-ended
Disadvantages Difficult to tabulate and code Increases respondent burdenwhen there are too many or too
limited response categories
High refusal rate because it requires
more time and effort on the
respondent
Bias responses against categories
excluded in the list of choices
Respondent needs to be articulate Difficult to detect if respondent
misinterpreted the question
Responses can be inappropriate or
vague
May threaten respondent
Responses have different levels of
detail
-
8/3/2019 Lecture Stat GS
34/97
Collection of Data
Pitfalls To Avoid in Wording Questions
1. Avoid Vague Questions
2. Avoid Biased Questions
3. Avoid Confidential and Sensitive Questions
4. Avoid Questions that are Difficult to Answer
5. Avoid Questions that are Confusing or
Perplexing to answer
6. Keep the question short and simple
-
8/3/2019 Lecture Stat GS
35/97
Sampling and Sampling Techniques
Advantages of Sampling
1. Sampling is more economical.
2. A study based on a sample requires less time to
accomplish.
3. Sampling allows for a wider scope for the study.
4. Results of studies based on a sample can even
be more accurate.5. Sampling is sometimes the only feasible
method.
-
8/3/2019 Lecture Stat GS
36/97
Sampling and Sampling Techniques
Target Population is the population we want study.
Sampled Populationis the population from where
we actually select the sample.
Elementary unit or element is a member of thepopulation whose measurement on the variableof interest is what we wish to examine.
Sampling unit is a unit of the population that weselect in our sample.
-
8/3/2019 Lecture Stat GS
37/97
-
8/3/2019 Lecture Stat GS
38/97
Sampling and Sampling Techniques
Target populationset of all establishments in the manufacturing, mining, and agricultureindustries
Elementary unitan establishment (which is an economic unit that engages under asingle ownership or control in one predominantly one kind of economicactivity at a fixed single physical location) in the manufacturing,mining, and agricultural industries.
Samplingunitan enterprise (which is an economic unit with one or moreestablishments into enterprises under a single ownership or control) inthe manufacturing, mining and agricultural industries.
-
8/3/2019 Lecture Stat GS
39/97
Sampling and Sampling Techniques
Sampling Frame or Frame
is a list or map showing all the sampling units in thepopulation.
Example
Suppose a researcher is interested in getting the opinion ofeligible voters on the media campaign of candidates running for topposition in the government.
Targetpopulation:set of all eligible voters
Sampling Frame: Commission on Elections (COMELEC) list ofregistered voters.
Sampledpopulation:set of registered voters in the list ofCOMELECThis sampledpopulationexcludes theeligiblevoters whodid
not registerordidnot revalidate their registration with theCOMELECduring the registrationperiod.
-
8/3/2019 Lecture Stat GS
40/97
Sampling and Sampling Techniques
Sampling Error
is the error attributed to the variation
present among the computed values of thestatistic from the different possible samples
consisting of n elements.
Nonsampling Error
is the error from the other sources apart
from sampling fluctuations.
-
8/3/2019 Lecture Stat GS
41/97
Sampling and Sampling Techniques
Sampling error occurs when we collect datafrom a sample and not from all the elementsin the population. It is an error innate in
results based from a sample.
Classifications of Nonsampling error
1. Measurement Error2. Error in the implementation of the sampling
design
-
8/3/2019 Lecture Stat GS
42/97
Sampling and Sampling Techniques
Measurement error
Is the difference between the true value of thevariable and the observed value used in the study.
This occurs when we are using a faultymeasurement instrument or when we do not usethe instrument properly.
Error in the implementation of the sampling design
occurs when we do not adhere to the proceduresand requirements as specified in the samplingdesign.
-
8/3/2019 Lecture Stat GS
43/97
Sampling and Sampling Techniques
Total Error
Nonsampling Error Sampling ErrorError in the Implementation of Measurement Error
The Sampling Design
Instrument Error
Selection Error Response Error Response Bias
Nonresponse Bias
Frame Error Processing Error
Population Specification Interviewer Bias
Error
Surrogate Information
Error
-
8/3/2019 Lecture Stat GS
44/97
Sampling and Sampling Techniques
Measurement Errors
1. Interviewer bias may occur when an enumerator reacts toa respondents reply
2. Errors in editing and coding
3. Bias occurs when respondent tends to respond to items inan acceptable manner instead of truthfully
4. Errors in conversion from one unit of measurement toanother
5. Response set occurs when a respondent agrees with all
the statements without careful consideration to each oneof the given statements.
6. Faulty measurement devices such as a weighing scale thatis not properly claibrated
-
8/3/2019 Lecture Stat GS
45/97
Sampling and Sampling Techniques
Errors in the Implementation of the Sampling Design
1. Sampling frame defines a sampled population that istoo far from the target population
2. Sampling frame is outdated
3. Complicated sample selection procedure is done inthe field by confused enumerators who incorrectlyselect the respondents included in the sample
4. Lazy enumerators do no follow the specified sample
selection procedure5. Target population is the target consumers of a
particular brand but researchers incorrectly define thequalifications of the target consumers.
-
8/3/2019 Lecture Stat GS
46/97
Sampling and Sampling Techniques
Probability Sampling
is a method ofselecting a sample wherein
each element in the population has a known,
nonzero chance ofbeing included in the
sample; otherwise, it is nonprobability
sampling.
-
8/3/2019 Lecture Stat GS
47/97
Sampling and Sampling Techniques
Probability Sampling Methods
1. Simple Random Sampling2. Stratified Sampling
3. Systematic Sampling
4. Cluster Sampling5. Multistage Sampling
-
8/3/2019 Lecture Stat GS
48/97
Probability Sampling Methods
Simple Random Sampling is a probability samplingmethod wherein all possible subsets consisting of nelements selected from the N elements of thepopulation have the same chances of selection.
In simple random sampling without replacement(SRSWOR),all the n elements in the sample must bedistinct from each other.
In simple random sampling with replacement(SRSWR),the n elements in the sample needed not be
distinct, that is, an element can be selected more thanonce to be a part of the sample.
-
8/3/2019 Lecture Stat GS
49/97
-
8/3/2019 Lecture Stat GS
50/97
Probability Sampling Methods
Systematic sampling is a probability sampling
method wherein the selection of the first
element is at random and the selection of the
other elements in the sample is systematic by
subsequently taking kth element from the
random start, where K is the sampling interval.
-
8/3/2019 Lecture Stat GS
51/97
Probability Sampling Methods
Cluster sampling is a probability sampling
method wherein we divide the population into
nonoverlapping groups or clusters consisting
of one or more elements, and then select a
sample of clusters. The sample will consist of
all the elements in the selected clusters.
-
8/3/2019 Lecture Stat GS
52/97
Probability Sampling Methods
Multistage sampling is a probability sampling
method where there is a hierarchical
configuration of sampling units and we select
a sample of these units in stages.
-
8/3/2019 Lecture Stat GS
53/97
Basic Methods of Probability SamplingMethod Procedure Advantages Disadvantages When to use
1. SimpleRandom
Sampling
List theelements and
number them
from 1 to N.
Select n
numbers from1 to N, using a
randomization
mechanism.
The sample will
consist of theelements
correspondings
to the numbers
selected.
Design issimple and
easy to
understand
Estimation
methods aresimple and
easy.
It needs a listof all elements
in the
population.
Sample size
must be verylarge for
heterogeneous
populations in
order to get
reliable results.
High
transportation
cost if
elements are
widely spread
geographically.
If the elementsare
homogeneous
with respect to
the
characteristic
under study.
If the elements
are not so
spread out
geographically.
-
8/3/2019 Lecture Stat GS
54/97
Basic Methods of Probability SamplingMethod Procedure Advantages Disadvantages When to use
2. Stratified RandomSampling
Divide the populationinto nonoverlapping
strata.
Obtain a simple
random sample from
each stratum.
The sample consists of
the selected samples inall the strata.
Estimates are morereliable compared to
SRS of the same
sample size if the
population has been
divided into strata with
homogeneous
elements, but the
strata are very different
from each other.
Estimation of
parameter for each
subpopulation is easier
when compared to
other sampling
methods
It can faciltiate theadministration and
supervision of data
collection, especially
the stratification
variables is geographic
subdivision.
It needs a list of allelements of the
population, including
their values of the
stratification variable.
High transportation
cost if elements are
widely spread
geographically, unlessthere are field offices
in each geographic
area.
If population isheterogeneous with
respect to the
characteristic under
study.
If we want to perform
separate analysis for
certain subpopulations.
If we wish to facilitate
the administration of
the collection of data.
-
8/3/2019 Lecture Stat GS
55/97
Basic Methods of Probability SamplingMethod Procedure Advantages Disadvantages When to use
3. Systematic Sampling Assign a uniquenumber from 1 to N to
each element of the
population.
Determine the
sampling interval, k.
Obtain the first
element in the sampleusing a randomization
mechanism.
Get the rest of the
elements in the sample
by taking every kth
element from the
random start.
Identifying the units inthe sample is easy.
The design does not
require a list of all
elements in the
population.
The sample is
distributed evenly overthe entire population.
It gives more reliable
estimates than simple
random sampling when
the arrangement of the
elements in the
sampling frame is
according tomagnitude.
Estimates may no bereliable when there are
periodic regularities in
the list.
It requires information
on the arrangement of
the elements in the
sampling frame to
determine thereliability of the
estimates.
If there is no availablelist of elements in the
population.
If the arrangement of
the elements in the
sampling frame is
according to
magnitude.
-
8/3/2019 Lecture Stat GS
56/97
Basic Methods of Probability SamplingMethod Procedure Advantages Disadvantages When to use
4. Cluster Sampling
5. Multistage Sampling
Divide the populationinto nonoverlapping
clusters.
Select a sample of
clusters using simple
random sampling.
The sample consists of
all the elements in theselected clusters.
Select sample in
several stages.
The design needs onlya list of clusters and
not a list of elements.
Transportation and
listing costs are usually
lower.
Reduced transportation
and listing cost.
Estimates are usuallyless reliable when
compared to other
sampling design.
It is not cost-efficient if
the clusters are large
and the elements are
homogeneous with
respect to thecharacteristic under
study.
Difficult estimation
procedures.
The design needs
thorough planningbefore performing
sample selection.
If there is no availablelist of elements.
If cost is more
important than
reliability of the
estimates.
If the geographic
coverageof the
population of interest
is wide.
If no listing of the
elementary units in the
population is available.
-
8/3/2019 Lecture Stat GS
57/97
Methods of Nonprobability Sampling
Nonprobability sampling methods do not makeuse any randomization mechanism inidentifying the sampling units included in the
sample. Rather, it allows the researcher tochoose the units in the sample objectively.Since the selection of the sample is subjective,there is consequently no objective way of
assessing the reliability of the results withoutmaking assumptions that there are oftentimesdifficult to verify.
-
8/3/2019 Lecture Stat GS
58/97
Methods of Nonprobability Sampling
Haphazard or Convenience Sampling
the sample consists of elements that are mostaccessible or easiest to contact. This usually includes
friends, acquaintances, volunteers, and subjects whoare available and willing to participate at the time ofthe study.
Example: The adviser of a student organization isconducting a research on study habits of students in
the university. To select a sample, the adviser includesthe members of the student organization because it iseasy to reach them and get data from them.
-
8/3/2019 Lecture Stat GS
59/97
Methods of Nonprobability Sampling
Judgement or Purposive Sampling
The researcher chooses a sample that agrees
with his/her subjective judgement of arepresentative sample.
-
8/3/2019 Lecture Stat GS
60/97
Methods of Nonprobability Sampling
Quota Samplingis the nonprobability sampling version ofstratified sampling. In quota sampling, the
researcher also chooses the grouping or strata inthe study but the selection of the sampling unitswithin the stratum does not make use of aprobability sampling method. The researcher just
sets a quota or number of sampling units to beincluded in each grouping but uses conveniencesampling to select the units within each grouping
-
8/3/2019 Lecture Stat GS
61/97
Sampling and Sampling Techniques
Sample Size DeterminationAn important component of the sampling designis the sample size. The number of elements that
you include in the sample must not be too smallbecause this will not allow you to come up withreliable estimates.
In determining the sample size, you should
always consider the reliability of the results of thestudy and, at the same time, the cost involved indoing the study.
-
8/3/2019 Lecture Stat GS
62/97
Presentation of DataTextual Presentation
Textual presentation ofdata incorporates importantfigures in a paragraph oftext.
Tabular Presentation
Tabular presentation ofdata arranges figures in asystematic manner in rows and columns.
Graphical PresentationGraphical presentation ofdata portrays numerical
figures or relationships among variables in pictorial form.
-
8/3/2019 Lecture Stat GS
63/97
Organization of Data
Raw Data
are data in their original form.
Array
is an ordered arrangement of data according
to magnitude. We also refer to the array assorted data or ordered data.
-
8/3/2019 Lecture Stat GS
64/97
Organization of Data
Frequency Distribution
Frequency distribution is a way of summarizing
data by showing the number of observations
that belong in the different categories or
classes We also refer to this as grouped data.
-
8/3/2019 Lecture Stat GS
65/97
Frequency Distribution
Final Grade No. of
Students
Final Grade No. of
Students
Final Grade No. of
Students
40-49 8 40-46 7 40-45 7
50-59 23 47-53 9 46-51 6
60-69 42 54-60 18 52-57 1070-79 62 61-67 30 58-63 24
80-89 58 68-74 41 64-69 26
90-99 17 75-81 48 70-75 35
Total 210 82-88 39 76-81 45
89-95 13 82-87 34
96-102 5 88-93 13
Total 210 94-99 10
Total 210
-
8/3/2019 Lecture Stat GS
66/97
Frequency Distribution Class Intervalis the range of values that belong in the class
or category. Class Frequencyis the number of observations that belong
in a class interval.
Class Limits are the end numbers used to define the classinterval. The lower class limit (LCL) is the lower end
number while the upper class limit (UCL) is the upper endnumber.
Class Boundaries are the true limits. If the observations arerounded figures, then we identify the class boundariesbased on the standard rules of rounding as follows: the
lower class boundary (LCB) is halfway between the lowerclass limit of the class and the upper class limit of thepreceding class while the upper class boundary (UCB) ishalfway between the upper class limit of the class thelower class limit of the next class.
-
8/3/2019 Lecture Stat GS
67/97
Frequency Distribution
Class Size is the size of the class interval. It is the
difference between the upper class boundaries of
the class and the preceding class; or the
difference between the lower class boundaries ofthe next class and the class. We can also use the
class limits in place of the class boundaries.
Class Markis the midpoint of a class interval. It is
the average of the lower class limit and the upper
class limit.
-
8/3/2019 Lecture Stat GS
68/97
Frequency Distribution
Final Grade No. ofStudents
40-49 8
50-59 23
60-69 42
70-79 62
80-89 58
90-99 17
Total 210
40-49 first Class Interval Class size; 50 40 = 10
8 Class frequency for the class
interval 40-49
Class Mark or Midpoint; (40 + 49)/2 =
44.5
40-49; 40 is the lower class limit
while 49 is the upper class limit
39.5-49.5; lower and upper class
boundaries
-
8/3/2019 Lecture Stat GS
69/97
Frequency DistributionSteps in the Construction of a Frequency Distribution
1. Determine the adequate number of classes K. Usuallybetween 5 to 20 or K = 1+3.322log n (Sturgess rule)
Log is the ordinary logarithm (base 10)
2. Determine the range (R) = highest observed value lowest observed value.
3. Compute for C = R/K4. Determine the class size C, by rounding off C to a
convenient number.
5. Choose the lower class limit of the first class. Usuallybased on the lowest observed value.
6. Tally all the observed values in each class interval.
7. Sum the frequency column and check against thetotal number of observations.
-
8/3/2019 Lecture Stat GS
70/97
Frequency Distribution
Less Than Cumulative FrequencyDistribution (CFD)
shows the number of observations with values largerthan or equal to the lower class boundary.
Cumulative Frequency Distribution is another variation
of the frequency distribution. We use this to determinehow many observations have values smaller than orgreater than a specified class boundary. It shows theaccumulated frequencies of successive classes, either atthe beginning or at the end of the distribution.
-
8/3/2019 Lecture Stat GS
71/97
Frequency Distribution
Final Grade No. of
Students
CFD
40-49 8 8 210
50-59 23 31 202
60-69 42 73 179
70-79 62 135 137
80-89 58 193 75
90-99 17 210 17
Total 210
h l f h
-
8/3/2019 Lecture Stat GS
72/97
Graphical Presentation of the
Frequency Distribution
1. Frequency Histogram
2. Frequency polygon
3. Less Than Ogive4. Greater Ogive
5. Pie Chart
-
8/3/2019 Lecture Stat GS
73/97
Measures of Central TendencyMeasures of CentralTendency, or "location", attempt to quantify what we mean
when we think of as the "typical" or "average" score in a data set. The concept isextremely important and we encounter it frequently in daily life.
For example, we often want to know before purchasing a car its average distanceper litre of petrol. Or before accepting a job, you might want to know what atypical salary is for people in that position so you will know whether or not you aregoing to be paid what you are worth. Or, if you are a smoker, you might often thinkabout how many cigarettes you smoke "on average" per day.
Statistics geared toward measuring central tendency all focus on this concept of"typical" or "average." As we will see, we often ask questions in psychologicalscience revolving around how groups differ from each other "on average". Answersto such a question tell us a lot about the phenomenon or process we are studying.
We also use measures of central tendency to facilitate the comparison of two or
more data sets. For example, a teacher may want to answer the question, Whoperformedbetter in exam, t he girls or the boys?The teacher can then comparethe average score of the girls with the average score of the boys. If the averagescore of the girls is higher, then the teacher can conclude that the girls generallyperformed better than the boys in the exam.
-
8/3/2019 Lecture Stat GS
74/97
Measures of Central Tendency
The Arithmetic Mean
The arithmetic mean, or simply called the mean, is
the most common type of average. It is the sum of
all the observed values divided by the number ofobservations.
population mean
bar sample mean
-
8/3/2019 Lecture Stat GS
75/97
Measures of Central Tendency
The Weighted Mean
When all the individual observed values have
equal importance, we compute for the arithmetic
mean. On the other hand, if we believe that theindividual observed values vary in their degree of
importance, then it is advisable to use a
modification of the mean that we call the
weightedmean. The WeightedMean assignsweights to the observations depending on their
relative importance.
-
8/3/2019 Lecture Stat GS
76/97
Measures of Central Tendency
The Trimmed Mean
We have noted that the mean may not bea good measure of central tendency whenever
there are outliers. An outlier pulls the value ofthe mean in its direction and farther awayfrom all of the observations. However, amodification of the arithmetic mean, called
the TrimmedMean, addresses the particularproblem.
-
8/3/2019 Lecture Stat GS
77/97
Measures of Central Tendency
The Median
The median divides an ordered set ofobservations into two equal parts. In other
words, it is the measure occupying thepositional center of the array. If anobservation is smaller than median, then itbelongs in the lower half of the array; and if
an observation is larger than the median, thenit belongs in the upper half of the array.
-
8/3/2019 Lecture Stat GS
78/97
Measures of Central Tendency
The Mode
The mode is the most frequent observedvalue in the data set. It is the observed value that
occurs the greatest number of times. If the dataset is small, we easily see the mode throughinspection. However, as the data becomes large,finding the mode is quite tedious without a
computer. Generally, the mode is a less popularmeasure of central tendency as compared to themean and the median.
Summary of the Different Measures of Central
-
8/3/2019 Lecture Stat GS
79/97
Summary of the Different Measures of Central
TendencyMeasure of
Central
Tendency
Definition Data Requirement Existence/Uniquene
ss
Takes into
account
every
value?
Affected
by
Outliers?
Can treat
formula
algebraic
ally
Mean
center of
mass
Sum of all
the values in
the
collection
divided bythe total
number of
elements in
the
collection
At least interval
scale and values
that are close to
each other
Always exist
s/Always unique
Yes Yes Yes
Mediancenter of
the array
Divides thearray into
two equal
parts
At least ordinalscaleAlwaysexists/Always
Unique
No No No
Mode
typical
value
Most
frequent
value
Even if nominal
scale only
Might not
exist/Not always
unique
No No No
-
8/3/2019 Lecture Stat GS
80/97
Measures of Location
A measure of location provides us information
on the percentage of observations in the
collection whose values are less than or equal
to it. We also commonly refer to thesemeasures of location as Quantiles or Fractiles.
The three measures of location
Percentiles, Quartiles, and Deciles
-
8/3/2019 Lecture Stat GS
81/97
Measures of Location
Percentiles divide the ordered observations into
100 equal parts.
Quartiles divide the ordered observations into 4
equal parts.
Deciles divide the ordered observations into 10
equal parts.
-
8/3/2019 Lecture Stat GS
82/97
Measures of Dispersion
A Measure of Dispersion is a descriptive summarymeasure that help us characterize the data set in termsof how varied the observations are from each other.This measure allows us to determine the degree of
dispersion of the observations about the center of thedistribution. If its value is small, then this indicates thatthe observations are not too different from each otherso that there is a concentration of observations about
the center. On the other hand, if its value is large, thenthis indicates that the observations are very differentfrom each other so that they are widely spread outfrom the center.
-
8/3/2019 Lecture Stat GS
83/97
-
8/3/2019 Lecture Stat GS
84/97
Measures of Dispersion
TheVariance andStandardDeviation
The variance is a measure of dispersion so we can use it todescribe the variation of the measurements in thecollection. It is defined as the average squared deviation or
difference of each observation from the mean. The squareddifference of an observation from the mean gives us anidea on how close this observation is to the mean. A largesquared difference indicates that the observation and themean are far from each other while a small squareddifference indicates that the observation and the mean areclose to each other. In fact, when the squared difference iszero (o) then this implies that the observation and themean are equal to each other.
-
8/3/2019 Lecture Stat GS
85/97
Measures of Dispersion
We can also use the variance to determine if themean is a good measure of central tendency. Asmall variance indicates that the observations arehighly concentrated about the mean so that it is
appropriate to use the mean to represent all ofthe values in the collection. Whereas, if thevariance is large then this indicate that, on theaverage, the observations are far or very differentfrom the mean. In this case, we cannot considerthe mean as a good measure of central tendencybecause it will not be suitable representative ofall values in the collection.
-
8/3/2019 Lecture Stat GS
86/97
Measures of Dispersion
The CoefficientofVariation
is a measure of relative variation. We can use
it to compare the variability of two or more data
sets even if they have different means or different
units of measurement because the coefficient of
variation has no unit.
The coefficientofvariation (CV) is the ratio ofthe standard deviation to the mean, expressed as
a percentage.
-
8/3/2019 Lecture Stat GS
87/97
Measures of Shape: Skewness and Kurtosis
A measure of skewness is a single value that indicates the degreeand direction of asymmetry.
If it is possible to divide the histogram at the center into two
identical halves, wherein each half is a mirror image of the
other, then the distribution is called a symmetricdistribution.
Otherwise, it is called a skeweddistribution.
Measures of Shape: Skewness and
-
8/3/2019 Lecture Stat GS
88/97
Measures of Shape: Skewness and
Kurtosis
If the concentration of the values is at left-end ofthe distribution and the upper tail of thedistribution stretches out more than the lowertail, then the distribution is said to be positivelyskewedor skewedto the right. Conversely, if theconcentration of the values is at the right-end ofthe distribution and the lower tail of thedistribution stretches out more than the uppertail, then the distribution is said to be negativelyskewedor skewedto theleft.
Measures of Shape: Skewness and
-
8/3/2019 Lecture Stat GS
89/97
Measures of Shape: Skewness and
Kurtosis
Skewed to the right
Measures of Shape: Skewness and
-
8/3/2019 Lecture Stat GS
90/97
Measures of Shape: Skewness and
Kurtosis
Skewed to the left
Measures of Shape: Skewness and
-
8/3/2019 Lecture Stat GS
91/97
Measures of Shape: Skewness and
Kurtosis
Interpreting If skewness is positive, the data are positively skewed or skewed right,
meaning that the right tail of the distribution is longer than the left. Ifskewness is negative, the data are negatively skewed or skewed left,meaning that the left tail is longer.
If skewness = 0, the data are perfectly symmetrical. But a skewness ofexactly zero is quite unlikely for real-world data, so how can you interpretthe skewness number? Bulmer, M. G., Principles ofStatistics (Dover,1979) a classic suggests this rule of thumb:
If skewness is less than 1 or greater than +1, the distribution is highlyskewed.
If skewness is between 1 and or between + and +1, the distributionismoderately skewed.
If skewness is between and +, the distribution is approximatelysymmetric.
Mean = Median = Mode ---- Symmetric Distribution
Mean > Median > Mode ----- Positively Skewed
Mean < Median < Mode ------ Negarively Skewed
Measures of Shape: Skewness and
-
8/3/2019 Lecture Stat GS
92/97
Measures of Shape: Skewness and
Kurtosis
Karl Pearson (1857-1936), the founder of
biometrics and a major contributor to the
theory of modern applied statistics, coined
the term kurtosis in the 1905. The term camefrom the Greek word kurtos, meaning
convex. Pearson used it to describe the shape
of the hump of a relative frequencydistribution as compared to the normal
distribution.
Measures of Shape: Skewness and
-
8/3/2019 Lecture Stat GS
93/97
Measures of Shape: Skewness and
Kurtosis
Mesokurtic the hump is the same as the normalcurve. It is neither too flat nor too peak. ( SectionC)
Leptokurtic the curve is more peaked and the
hump is narrower or sharper than the normalcurve. The prefix lepto came from the Greekleptos meaning small or thin. (Section E)
Platykurtic the curve is less peaked and the hump
is flatter than the normal curve. The prefixplaty came from the Greek word platusmeaning wide or flat. (Section D)
Measures of Shape: Skewness and
-
8/3/2019 Lecture Stat GS
94/97
Measures of Shape: Skewness and
Kurtosis
-
8/3/2019 Lecture Stat GS
95/97
Sampling Distributions
Basic Concepts
In Inferential Statistics, we come up with
generalizations about the population using the
information that we collect from a sample. We
will require this sample to be a random
sample.
-
8/3/2019 Lecture Stat GS
96/97
Sampling Distributions
The samplingdistributionofa statisticis its
probability distribution.
discrete- pmf
continuous pdf
The standarddeviationof a statistic is called its
standarderror.
-
8/3/2019 Lecture Stat GS
97/97
Tests of Hypotheses
Basic Concepts in Testing Statistical Hypotheses
the first step in hypothesis testing is to identify andstate the statistical hypotheses to be tested.
Astatisticalhypothesis is a conjecture concerning one
or more populations whose veracity can be establishedusing sample data. The Null Hypothesis,denoted asHo, is a statistical hypothesis which the researcherdoubts to be true. The Alternative Hypothesis,denoted as Ha,is the operational statement of the
theory that the researcher believes to be true andwishes to prove and is contradiction of the nullhypothesis.