Lecture Stat GS

8/3/2019 Lecture Stat GS

1/97

Introduction

STATISTICS is the most important science in thewhole world; for upon it depends the practical

application of every science and of every art; theone science essential to all political and socialadministration, all education, all organization

based on experience, for it only gives results of

our experience.-Florence Nightingale-


2/97

Basic Concepts

The term statistics originated from the Latinword status, which means state.

The original definition was the sciencedealing with data about the condition of astate or community.

Statistics is the branch of science that deals withthe collection, presentation, organization,analysis, and interpretation of data.


3/97

Basic Concepts

Thepopulation is the collection of all

elements under consideration in a statistical

inquiry.

The sample is a subset of a population.

The variable is a characteristic or attribute of

the elements in a collection that can assume

different values for the different elements. An observation is a realized value of a variable.

Data is the collection of observations.


4/97

Basic Concepts

Example

Variable Possible Observations

S= sex of a student Male, Female

E= employment status of an employees Permanent,Temporary,Contractual

I = monthly income of a person in greater than or equal

pesos to zero

N = number of children of a teacher n = 0, 1, 2, 3, ..

H = height of a basketball player h > 0 cms, in.


5/97

Basic Concepts

Example

The office ofAdmissions is studying the relationship

between the score in the entrance examination during

application and the general weighted average (GWA)upon graduation among graduates of the university from2005 to 2010.

Population:collection of all graduates of the universityfrom the years 2005 to 2010.

VariableofInterest: score in the entrance examinationand GWA


6/97

Basic Concepts

The Department of Health is interested I

determining the percentage of children below

12 years old infected by the Hepatitis B virus

in Laguna in 2010.

Population: set of all children below 12 years old

in Laguna in 2010.

Variableof interest: whether or not the child has

ever been infected by the Hepatitis B virus.


7/97

Basic Concepts

Theparameteris a summary measure

describing a specific characteristic of the

population.

The statisticis a summary measure describing

a specific characteristic of the sample.


8/97

Fields of Statistics

Two Major fields of Statistics1. Applied Statistics Is concerned with the procedures and techniques used in

the collection, presentation, organization, analysis andinterpretation of data. We study applied statistics in order to learn how to select and

properly implement the most appropriate statistical methods thatwill provide answers to our research problem.

2. Theoretical or Mathematical Statistics

is concerned with the development of the mathematical

foundations of the methods used in applied statistics.We study mathematical statistics in order to understand the

rationale behind the statistical methods we use in analysis and toestablish new theories that will validate the use of new statisticalmethods or modifications of existing statistical methods in solvingproblems that are more complex.


9/97

Two Major Areas of Interest in

Applied Statistics

1. Descriptive Statistics

Includes all the techniques used in organizing,summarizing, and presenting the data on hand.

The data on hand may have come from all the elementsof the population so that the analysis using descriptivestatistics will allow us to describe the population.

The data on hand may also come from the elements ofa selected sample. In this case, the analysis using

descriptive statistics will only allow us to describe thesample. The methods used in descriptive statistics willnot allow us to generalize about the population usingthe sample data.


10/97


Applied Statistics

In descriptive statistics, we use tables and charts,and compute for summary measures likeaverages, proportions, andpercentages.

2. Inferential Statistics We do not simply describe the sample data. Rather,

we use the sample data to form conclusions about thepopulation. Since the sample is only a subset of the

population, then we arrive at the conclusions aboutthe population using inferential statistics underconditions of uncertainty.


11/97


Applied Statistics

Example

1. A badminton player wants to know his

average score for the past 10 games.descriptive

2. Joseph wants to determine the variability of

his seven exam scores in Statistics.descriptive


12/97


Applied Statistics3. Based on last years electricity bills, Mrs. Mercado would like to forecast

the average monthly electricity bill she will pay for the next year based onher average monthly bill in the past year.

inferential

4. Efren Bata wants to estimate his chance of winning in the next World

Championship game in Billiards based on his average scores lastchampionship and the averages of the competing players.

inferential

5. Dr. Escape wants to determine the proportion spent on transportationduring the past four months using the daily records of expenditure thatshe keeps.

descriptive6. A politician wants to determine the total number of votes his rival obtained

in the sample used in the exit poll.

descriptive


13/97

Steps in a Statistical InquiryStatistical inquiry is a designed research that provides

information needed to solve a research problem1. Describe the characteristic of the elements in the

population under study through the computation orestimation of a parameter such as the proportion, total,and average.

2. Compare the characteristics of the elements in thedifferent subgroups in the population through contrasts oftheir respective summary measures.

3. Justify an assertion made by the researcher about aparticular characteristic of the population or subgroups in

the population.4. Determine the nature and strength of relationships among

the different variables of interest.

5. Identify the different groups of interrelated variablesunder study.


14/97

Steps in a Statistical Inquiry

6. Reveal the natural groupings of the elements in thepopulation based on the values of a set of variables.

7. Determine the effects of one or more variables on a

response variable.8. Clarify patterns and trends in the values of a variable

over time or space.

9. Predict the value of a variable based upon its

relationship with another variable.10. Forecast future values of a variable using a sequence

of observations on the same variable taken over time.


15/97

Basic Steps in Performing a Statistical

Inquiry1. Identify the problem.

The researchers need to define and state the problem in a clearmanner so that they can arrive at appropriate solutions andrecommendations later on.

2. Plan the study.Some statistical inquiries do not reach completion or do not succeedin arriving at useful information for sound decision making because ofthe researchers failure to plan the study carefully.

3. Collect the data.

The investigators take extra measures to ensure the quality of thedata collected. If the collected data were incomplete, outdated,inaccurate, or worse yet, fabricated, then it will be useless to proceedwith data analysis.

There are different ways of collecting data. These are throughsurveys, observation, experiments, and use of available documenteddata.


16/97


Inquiry4. Explore the data.

Prior to data analysis, the investigators need to explore andunderstand the essential features of their data. This processallows them to determine if their data satisfy the assumptionsmade in the derivation of the statistical technique that they willuse for analysis.

5. Analyze data and interpret the results.After collecting and organizing data, analysis follows. Theinvestigators once more carry out the plans specified in the researchdesign but this time on data analysis. They then examine all theresults on tables, charts, estimated summary measures, and tests ofhypotheses. They need to check that they were able to meet all of

the specific objectives. Based on the analysis carried out, theinvestigators must be able to answer the research problem and giverecommendations on how this can be useful in decision making.

The investigators must double-check the results that contradictexisting theory or the earlier hypothesis made. They may havecommitted errors in data collection or analysis. If not, they wouldhave to propose possible explanations for these results or suggestfuture statistical inquiries that could help explain the inconsistency.


17/97


Inquiry

6. Present the results.

After analyzing the data and interpreting

the results, the investigators must presentthese results in a clear and concise manner to

the users of the research.


18/97

Measurement

Process of determining the value or label of thevariable based on what has been observed

Ratio level of measurement has all of thefollowing properties;

A) the numbers in the system are used to classifya person/object into distinct, nonoverlapping,

and exhaustive categories; B) the system arranges the categories according

to magnitude;


19/97

C) the system has a fixed unit of measurement

representing a set size throughout the scale,

and

D) the system has an absolute zero.

Examples: 1. allowance of a student (in pesos) 2. distance travelled by an airplane (in kms)

3. the speed of a car (in kms/hr)

4. height of an adult (in cms)

5. weight of a newborn baby (in kgs)


20/97

Interval level of measurement satisfies only the firstthree properties of the ratio level.

The only difference between the interval and the ratiolevels is the interpretation of the value O in theirscales. The zero point in the interval level is not anabsolute zero. Unlike in the ratio scale, the zero valuein the interval scale has an arbitrary interpretation anddoes not mean the absence of the property we aremeasuring.

Examples: Temperature readings measured in degreesCentigrade (0 C), Intelligent Quotient (IQ)

If the temperature is O C, do we say that there is no

temperature? Of course not, Since O C is not anabsolute zero.


21/97

Ordinal level of measurement satisfies onlythe first two properties of the ratio level.

Examples.

1. Size of shirt ( small, medium, large, extra

large) 2. Performance rating of a salesperson

measured as follows: Excellent, very good,good, satisfactory, poor

3. Faculty rank: Professor, Associate Professor,Assistant Professor, Instructor


22/97

Nominal level of measurement satisfies only

the first property of the ratio level.

Examples:

1. Gender

2. Civil Status

3. Type of movies ( Action, Roamance,

Comedy. Others)

4. Major island group ( Luzon, Visayas,

Mindanao)


23/97

Data Collection Methods

1. Use of available documented data inpublished or unpublished studies.

2. Surveys

3. Experiments

4. Observations


24/97

Collection of Data

Primary Data

Are data documented by the primary source. The

data collectors themselves documented this data.

Secondary Data

are data documented by a secondary source. AN

individual/agency, other than the data collectors,documented this data.


25/97

Collection of Data

Survey

is a method of collecting data on thevariable of interest by asking people

questions. When data came from asking allthe people in the population, then the study iscalled a census. On the other hand, when datacame from asking a sample of people selectedfrom a well-defined population, then thestudy is called a sample survey.


26/97

Collection of Data

Experiment

is a method of collecting data where there

is direct human intervention on the conditionsthat may affect the values of the variable of

interest.


27/97


28/97

Collection of Data

Observation method

is a method of collecting data on the

phenomenon of interest by recording theobservations made about the phenomenon as

it actually happens.


29/97

Collection of Data

Examples:

1) A local TV network asked voters to indicate whom theyvoted as they exited the polling booth. Survey

2) A private hospital divides terminally ill patients into twogroups, with one group receiving medication A and theother group receiving medication B. After a month, theymeasured each subjects improvement. Experiment

3) A researcher investigates the level of pollution in keypoints in Metro Manila by setting up pollution measuringdevices at selected intersections. Observation


30/97

Collection of Data

Questionnaire

is a measurement instrument used in various datacollection methods, particularly surveys. We use aquestionnaire to determine and record the

measurements of characteristics of the elements in astudy such as height, weight, color, size, attitude, pastand present behavior, and opinions.

Two Types of Questionnaire Used in SurveysSelf-Administered Questionnaire

Interview Schedule


31/97

Collection of Data

Types of Questions

1. Closed-ended question

is a type of question that includes a list of response

categories from which the respondent will select his

answer.

2. Open-ended question

is a type of question that does not include response

categories.


32/97

Collection of DataOpen-ended Closed-ended

Advantages Respondent can freely answer Facilitates tabulation of responses

Can elicit feelings and emotions of the

respondent

Easy to code and analyze

Can reveal new ideas and views that the

researcher might not have considered

Saves time and money

Good for complex issues High response rate since it is

simple and quick to answer

Good for questions whose possible

responses are unknown

Response categories make

questions easy to understand

Allows respondent to clarify answers Can repeat the study and easily

make comparisons

Gets detailed answers

Shows how respondents think


33/97

Collection of DataOpen-ended Closed-ended

Disadvantages Difficult to tabulate and code Increases respondent burdenwhen there are too many or too

limited response categories

High refusal rate because it requires

more time and effort on the

respondent

Bias responses against categories

excluded in the list of choices

Respondent needs to be articulate Difficult to detect if respondent

misinterpreted the question

Responses can be inappropriate or

vague

May threaten respondent

Responses have different levels of

detail


34/97

Collection of Data

Pitfalls To Avoid in Wording Questions

1. Avoid Vague Questions

2. Avoid Biased Questions

3. Avoid Confidential and Sensitive Questions

4. Avoid Questions that are Difficult to Answer

5. Avoid Questions that are Confusing or

Perplexing to answer

6. Keep the question short and simple


35/97

Sampling and Sampling Techniques

Advantages of Sampling

1. Sampling is more economical.

2. A study based on a sample requires less time to

accomplish.

3. Sampling allows for a wider scope for the study.

4. Results of studies based on a sample can even

be more accurate.5. Sampling is sometimes the only feasible

method.


36/97


Target Population is the population we want study.

Sampled Populationis the population from where

we actually select the sample.

Elementary unit or element is a member of thepopulation whose measurement on the variableof interest is what we wish to examine.

Sampling unit is a unit of the population that weselect in our sample.


37/97


38/97


Target populationset of all establishments in the manufacturing, mining, and agricultureindustries

Elementary unitan establishment (which is an economic unit that engages under asingle ownership or control in one predominantly one kind of economicactivity at a fixed single physical location) in the manufacturing,mining, and agricultural industries.

Samplingunitan enterprise (which is an economic unit with one or moreestablishments into enterprises under a single ownership or control) inthe manufacturing, mining and agricultural industries.


39/97


Sampling Frame or Frame

is a list or map showing all the sampling units in thepopulation.

Example

Suppose a researcher is interested in getting the opinion ofeligible voters on the media campaign of candidates running for topposition in the government.

Targetpopulation:set of all eligible voters

Sampling Frame: Commission on Elections (COMELEC) list ofregistered voters.

Sampledpopulation:set of registered voters in the list ofCOMELECThis sampledpopulationexcludes theeligiblevoters whodid

not registerordidnot revalidate their registration with theCOMELECduring the registrationperiod.


40/97


Sampling Error

is the error attributed to the variation

present among the computed values of thestatistic from the different possible samples

consisting of n elements.

Nonsampling Error

is the error from the other sources apart

from sampling fluctuations.


41/97


Sampling error occurs when we collect datafrom a sample and not from all the elementsin the population. It is an error innate in

results based from a sample.

Classifications of Nonsampling error

1. Measurement Error2. Error in the implementation of the sampling

design


42/97


Measurement error

Is the difference between the true value of thevariable and the observed value used in the study.

This occurs when we are using a faultymeasurement instrument or when we do not usethe instrument properly.

Error in the implementation of the sampling design

occurs when we do not adhere to the proceduresand requirements as specified in the samplingdesign.


43/97


Total Error

Nonsampling Error Sampling ErrorError in the Implementation of Measurement Error

The Sampling Design

Instrument Error

Selection Error Response Error Response Bias

Nonresponse Bias

Frame Error Processing Error

Population Specification Interviewer Bias

Error

Surrogate Information

Error


44/97


Measurement Errors

1. Interviewer bias may occur when an enumerator reacts toa respondents reply

2. Errors in editing and coding

3. Bias occurs when respondent tends to respond to items inan acceptable manner instead of truthfully

4. Errors in conversion from one unit of measurement toanother

5. Response set occurs when a respondent agrees with all

the statements without careful consideration to each oneof the given statements.

6. Faulty measurement devices such as a weighing scale thatis not properly claibrated


45/97


Errors in the Implementation of the Sampling Design

1. Sampling frame defines a sampled population that istoo far from the target population

2. Sampling frame is outdated

3. Complicated sample selection procedure is done inthe field by confused enumerators who incorrectlyselect the respondents included in the sample

4. Lazy enumerators do no follow the specified sample

selection procedure5. Target population is the target consumers of a

particular brand but researchers incorrectly define thequalifications of the target consumers.


46/97


Probability Sampling

is a method ofselecting a sample wherein

each element in the population has a known,

nonzero chance ofbeing included in the

sample; otherwise, it is nonprobability

sampling.


47/97


Probability Sampling Methods

1. Simple Random Sampling2. Stratified Sampling

3. Systematic Sampling

4. Cluster Sampling5. Multistage Sampling


48/97


Simple Random Sampling is a probability samplingmethod wherein all possible subsets consisting of nelements selected from the N elements of thepopulation have the same chances of selection.

In simple random sampling without replacement(SRSWOR),all the n elements in the sample must bedistinct from each other.

In simple random sampling with replacement(SRSWR),the n elements in the sample needed not be

distinct, that is, an element can be selected more thanonce to be a part of the sample.


49/97


50/97


Systematic sampling is a probability sampling

method wherein the selection of the first

element is at random and the selection of the

other elements in the sample is systematic by

subsequently taking kth element from the

random start, where K is the sampling interval.


51/97


Cluster sampling is a probability sampling

method wherein we divide the population into

nonoverlapping groups or clusters consisting

of one or more elements, and then select a

sample of clusters. The sample will consist of

all the elements in the selected clusters.


52/97


Multistage sampling is a probability sampling

method where there is a hierarchical

configuration of sampling units and we select

a sample of these units in stages.


53/97

Basic Methods of Probability SamplingMethod Procedure Advantages Disadvantages When to use

1. SimpleRandom

Sampling

List theelements and

number them

from 1 to N.

Select n

numbers from1 to N, using a

randomization

mechanism.

The sample will

consist of theelements

correspondings

to the numbers

selected.

Design issimple and

easy to

understand

Estimation

methods aresimple and

easy.

It needs a listof all elements

in the

population.

Sample size

must be verylarge for

heterogeneous

populations in

order to get

reliable results.

High

transportation

cost if

elements are

widely spread

geographically.

If the elementsare

homogeneous

with respect to

the

characteristic

under study.

If the elements

are not so

spread out

geographically.


54/97


2. Stratified RandomSampling

Divide the populationinto nonoverlapping

strata.

Obtain a simple

random sample from

each stratum.

The sample consists of

the selected samples inall the strata.

Estimates are morereliable compared to

SRS of the same

sample size if the

population has been

divided into strata with

homogeneous

elements, but the

strata are very different

from each other.

Estimation of

parameter for each

subpopulation is easier

when compared to

other sampling

methods

It can faciltiate theadministration and

supervision of data

collection, especially

the stratification

variables is geographic

subdivision.

It needs a list of allelements of the

population, including

their values of the

stratification variable.

High transportation

cost if elements are

widely spread

geographically, unlessthere are field offices

in each geographic

area.

If population isheterogeneous with

respect to the

characteristic under

study.

If we want to perform

separate analysis for

certain subpopulations.

If we wish to facilitate

the administration of

the collection of data.


55/97


3. Systematic Sampling Assign a uniquenumber from 1 to N to

each element of the

population.

Determine the

sampling interval, k.

Obtain the first

element in the sampleusing a randomization

mechanism.

Get the rest of the

elements in the sample

by taking every kth

element from the

random start.

Identifying the units inthe sample is easy.

The design does not

require a list of all

elements in the

population.

The sample is

distributed evenly overthe entire population.

It gives more reliable

estimates than simple

random sampling when

the arrangement of the

elements in the

sampling frame is

according tomagnitude.

Estimates may no bereliable when there are

periodic regularities in

the list.

It requires information

on the arrangement of

the elements in the

sampling frame to

determine thereliability of the

estimates.

If there is no availablelist of elements in the

population.

If the arrangement of

the elements in the

sampling frame is

according to

magnitude.


56/97


4. Cluster Sampling

5. Multistage Sampling

Divide the populationinto nonoverlapping

clusters.

Select a sample of

clusters using simple

random sampling.

The sample consists of

all the elements in theselected clusters.

Select sample in

several stages.

The design needs onlya list of clusters and

not a list of elements.

Transportation and

listing costs are usually

lower.

Reduced transportation

and listing cost.

Estimates are usuallyless reliable when

compared to other

sampling design.

It is not cost-efficient if

the clusters are large

and the elements are

homogeneous with

respect to thecharacteristic under

study.

Difficult estimation

procedures.

The design needs

thorough planningbefore performing

sample selection.

If there is no availablelist of elements.

If cost is more

important than

reliability of the

estimates.

If the geographic

coverageof the

population of interest

is wide.

If no listing of the

elementary units in the

population is available.


57/97

Methods of Nonprobability Sampling

Nonprobability sampling methods do not makeuse any randomization mechanism inidentifying the sampling units included in the

sample. Rather, it allows the researcher tochoose the units in the sample objectively.Since the selection of the sample is subjective,there is consequently no objective way of

assessing the reliability of the results withoutmaking assumptions that there are oftentimesdifficult to verify.


58/97


Haphazard or Convenience Sampling

the sample consists of elements that are mostaccessible or easiest to contact. This usually includes

friends, acquaintances, volunteers, and subjects whoare available and willing to participate at the time ofthe study.

Example: The adviser of a student organization isconducting a research on study habits of students in

the university. To select a sample, the adviser includesthe members of the student organization because it iseasy to reach them and get data from them.


59/97


Judgement or Purposive Sampling

The researcher chooses a sample that agrees

with his/her subjective judgement of arepresentative sample.


60/97


Quota Samplingis the nonprobability sampling version ofstratified sampling. In quota sampling, the

researcher also chooses the grouping or strata inthe study but the selection of the sampling unitswithin the stratum does not make use of aprobability sampling method. The researcher just

sets a quota or number of sampling units to beincluded in each grouping but uses conveniencesampling to select the units within each grouping


61/97


Sample Size DeterminationAn important component of the sampling designis the sample size. The number of elements that

you include in the sample must not be too smallbecause this will not allow you to come up withreliable estimates.

In determining the sample size, you should

always consider the reliability of the results of thestudy and, at the same time, the cost involved indoing the study.


62/97

Presentation of DataTextual Presentation

Textual presentation ofdata incorporates importantfigures in a paragraph oftext.

Tabular Presentation

Tabular presentation ofdata arranges figures in asystematic manner in rows and columns.

Graphical PresentationGraphical presentation ofdata portrays numerical

figures or relationships among variables in pictorial form.


63/97

Organization of Data

Raw Data

are data in their original form.

Array

is an ordered arrangement of data according

to magnitude. We also refer to the array assorted data or ordered data.


64/97

Organization of Data

Frequency Distribution

Frequency distribution is a way of summarizing

data by showing the number of observations

that belong in the different categories or

classes We also refer to this as grouped data.


65/97


Final Grade No. of

Students

Final Grade No. of

Students

Final Grade No. of

Students

40-49 8 40-46 7 40-45 7

50-59 23 47-53 9 46-51 6

60-69 42 54-60 18 52-57 1070-79 62 61-67 30 58-63 24

80-89 58 68-74 41 64-69 26

90-99 17 75-81 48 70-75 35

Total 210 82-88 39 76-81 45

89-95 13 82-87 34

96-102 5 88-93 13

Total 210 94-99 10

Total 210


66/97

Frequency Distribution Class Intervalis the range of values that belong in the class

or category. Class Frequencyis the number of observations that belong

in a class interval.

Class Limits are the end numbers used to define the classinterval. The lower class limit (LCL) is the lower end

number while the upper class limit (UCL) is the upper endnumber.

Class Boundaries are the true limits. If the observations arerounded figures, then we identify the class boundariesbased on the standard rules of rounding as follows: the

lower class boundary (LCB) is halfway between the lowerclass limit of the class and the upper class limit of thepreceding class while the upper class boundary (UCB) ishalfway between the upper class limit of the class thelower class limit of the next class.


67/97


Class Size is the size of the class interval. It is the

difference between the upper class boundaries of

the class and the preceding class; or the

difference between the lower class boundaries ofthe next class and the class. We can also use the

class limits in place of the class boundaries.

Class Markis the midpoint of a class interval. It is

the average of the lower class limit and the upper

class limit.


68/97


Final Grade No. ofStudents

40-49 8

50-59 23

60-69 42

70-79 62

80-89 58

90-99 17

Total 210

40-49 first Class Interval Class size; 50 40 = 10

8 Class frequency for the class

interval 40-49

Class Mark or Midpoint; (40 + 49)/2 =

44.5

40-49; 40 is the lower class limit

while 49 is the upper class limit

39.5-49.5; lower and upper class

boundaries


69/97

Frequency DistributionSteps in the Construction of a Frequency Distribution

1. Determine the adequate number of classes K. Usuallybetween 5 to 20 or K = 1+3.322log n (Sturgess rule)

Log is the ordinary logarithm (base 10)

2. Determine the range (R) = highest observed value lowest observed value.

3. Compute for C = R/K4. Determine the class size C, by rounding off C to a

convenient number.

5. Choose the lower class limit of the first class. Usuallybased on the lowest observed value.

6. Tally all the observed values in each class interval.

7. Sum the frequency column and check against thetotal number of observations.


70/97


Less Than Cumulative FrequencyDistribution (CFD)

shows the number of observations with values largerthan or equal to the lower class boundary.

Cumulative Frequency Distribution is another variation

of the frequency distribution. We use this to determinehow many observations have values smaller than orgreater than a specified class boundary. It shows theaccumulated frequencies of successive classes, either atthe beginning or at the end of the distribution.


71/97


Final Grade No. of

Students

CFD

40-49 8 8 210

50-59 23 31 202

60-69 42 73 179

70-79 62 135 137

80-89 58 193 75

90-99 17 210 17

Total 210

h l f h


72/97

Graphical Presentation of the


1. Frequency Histogram

2. Frequency polygon

3. Less Than Ogive4. Greater Ogive

5. Pie Chart


73/97

Measures of Central TendencyMeasures of CentralTendency, or "location", attempt to quantify what we mean

when we think of as the "typical" or "average" score in a data set. The concept isextremely important and we encounter it frequently in daily life.

For example, we often want to know before purchasing a car its average distanceper litre of petrol. Or before accepting a job, you might want to know what atypical salary is for people in that position so you will know whether or not you aregoing to be paid what you are worth. Or, if you are a smoker, you might often thinkabout how many cigarettes you smoke "on average" per day.

Statistics geared toward measuring central tendency all focus on this concept of"typical" or "average." As we will see, we often ask questions in psychologicalscience revolving around how groups differ from each other "on average". Answersto such a question tell us a lot about the phenomenon or process we are studying.

We also use measures of central tendency to facilitate the comparison of two or

more data sets. For example, a teacher may want to answer the question, Whoperformedbetter in exam, t he girls or the boys?The teacher can then comparethe average score of the girls with the average score of the boys. If the averagescore of the girls is higher, then the teacher can conclude that the girls generallyperformed better than the boys in the exam.


74/97

Measures of Central Tendency

The Arithmetic Mean

The arithmetic mean, or simply called the mean, is

the most common type of average. It is the sum of

all the observed values divided by the number ofobservations.

population mean

bar sample mean


75/97


The Weighted Mean

When all the individual observed values have

equal importance, we compute for the arithmetic

mean. On the other hand, if we believe that theindividual observed values vary in their degree of

importance, then it is advisable to use a

modification of the mean that we call the

weightedmean. The WeightedMean assignsweights to the observations depending on their

relative importance.


76/97


The Trimmed Mean

We have noted that the mean may not bea good measure of central tendency whenever

there are outliers. An outlier pulls the value ofthe mean in its direction and farther awayfrom all of the observations. However, amodification of the arithmetic mean, called

the TrimmedMean, addresses the particularproblem.


77/97


The Median

The median divides an ordered set ofobservations into two equal parts. In other

words, it is the measure occupying thepositional center of the array. If anobservation is smaller than median, then itbelongs in the lower half of the array; and if

an observation is larger than the median, thenit belongs in the upper half of the array.


78/97


The Mode

The mode is the most frequent observedvalue in the data set. It is the observed value that

occurs the greatest number of times. If the dataset is small, we easily see the mode throughinspection. However, as the data becomes large,finding the mode is quite tedious without a

computer. Generally, the mode is a less popularmeasure of central tendency as compared to themean and the median.

Summary of the Different Measures of Central


79/97

Summary of the Different Measures of Central

TendencyMeasure of

Central

Tendency

Definition Data Requirement Existence/Uniquene

ss

Takes into

account

every

value?

Affected

by

Outliers?

Can treat

formula

algebraic

ally

Mean

center of

mass

Sum of all

the values in

the

collection

divided bythe total

number of

elements in

the

collection

At least interval

scale and values

that are close to

each other

Always exist

s/Always unique

Yes Yes Yes

Mediancenter of

the array

Divides thearray into

two equal

parts

At least ordinalscaleAlwaysexists/Always

Unique

No No No

Mode

typical

value

Most

frequent

value

Even if nominal

scale only

Might not

exist/Not always

unique

No No No


80/97

Measures of Location

A measure of location provides us information

on the percentage of observations in the

collection whose values are less than or equal

to it. We also commonly refer to thesemeasures of location as Quantiles or Fractiles.

The three measures of location

Percentiles, Quartiles, and Deciles


81/97

Measures of Location

Percentiles divide the ordered observations into

100 equal parts.

Quartiles divide the ordered observations into 4

equal parts.

Deciles divide the ordered observations into 10

equal parts.


82/97

Measures of Dispersion

A Measure of Dispersion is a descriptive summarymeasure that help us characterize the data set in termsof how varied the observations are from each other.This measure allows us to determine the degree of

dispersion of the observations about the center of thedistribution. If its value is small, then this indicates thatthe observations are not too different from each otherso that there is a concentration of observations about

the center. On the other hand, if its value is large, thenthis indicates that the observations are very differentfrom each other so that they are widely spread outfrom the center.


83/97


84/97


TheVariance andStandardDeviation

The variance is a measure of dispersion so we can use it todescribe the variation of the measurements in thecollection. It is defined as the average squared deviation or

difference of each observation from the mean. The squareddifference of an observation from the mean gives us anidea on how close this observation is to the mean. A largesquared difference indicates that the observation and themean are far from each other while a small squareddifference indicates that the observation and the mean areclose to each other. In fact, when the squared difference iszero (o) then this implies that the observation and themean are equal to each other.


85/97


We can also use the variance to determine if themean is a good measure of central tendency. Asmall variance indicates that the observations arehighly concentrated about the mean so that it is

appropriate to use the mean to represent all ofthe values in the collection. Whereas, if thevariance is large then this indicate that, on theaverage, the observations are far or very differentfrom the mean. In this case, we cannot considerthe mean as a good measure of central tendencybecause it will not be suitable representative ofall values in the collection.


86/97


The CoefficientofVariation

is a measure of relative variation. We can use

it to compare the variability of two or more data

sets even if they have different means or different

units of measurement because the coefficient of

variation has no unit.

The coefficientofvariation (CV) is the ratio ofthe standard deviation to the mean, expressed as

a percentage.


87/97

Measures of Shape: Skewness and Kurtosis

A measure of skewness is a single value that indicates the degreeand direction of asymmetry.

If it is possible to divide the histogram at the center into two

identical halves, wherein each half is a mirror image of the

other, then the distribution is called a symmetricdistribution.

Otherwise, it is called a skeweddistribution.

Measures of Shape: Skewness and


88/97


Kurtosis

If the concentration of the values is at left-end ofthe distribution and the upper tail of thedistribution stretches out more than the lowertail, then the distribution is said to be positivelyskewedor skewedto the right. Conversely, if theconcentration of the values is at the right-end ofthe distribution and the lower tail of thedistribution stretches out more than the uppertail, then the distribution is said to be negativelyskewedor skewedto theleft.



89/97


Kurtosis

Skewed to the right



90/97


Kurtosis

Skewed to the left



91/97


Kurtosis

Interpreting If skewness is positive, the data are positively skewed or skewed right,

meaning that the right tail of the distribution is longer than the left. Ifskewness is negative, the data are negatively skewed or skewed left,meaning that the left tail is longer.

If skewness = 0, the data are perfectly symmetrical. But a skewness ofexactly zero is quite unlikely for real-world data, so how can you interpretthe skewness number? Bulmer, M. G., Principles ofStatistics (Dover,1979) a classic suggests this rule of thumb:

If skewness is less than 1 or greater than +1, the distribution is highlyskewed.

If skewness is between 1 and or between + and +1, the distributionismoderately skewed.

If skewness is between and +, the distribution is approximatelysymmetric.

Mean = Median = Mode ---- Symmetric Distribution

Mean > Median > Mode ----- Positively Skewed

Mean < Median < Mode ------ Negarively Skewed



92/97


Kurtosis

Karl Pearson (1857-1936), the founder of

biometrics and a major contributor to the

theory of modern applied statistics, coined

the term kurtosis in the 1905. The term camefrom the Greek word kurtos, meaning

convex. Pearson used it to describe the shape

of the hump of a relative frequencydistribution as compared to the normal

distribution.



93/97


Kurtosis

Mesokurtic the hump is the same as the normalcurve. It is neither too flat nor too peak. ( SectionC)

Leptokurtic the curve is more peaked and the

hump is narrower or sharper than the normalcurve. The prefix lepto came from the Greekleptos meaning small or thin. (Section E)

Platykurtic the curve is less peaked and the hump

is flatter than the normal curve. The prefixplaty came from the Greek word platusmeaning wide or flat. (Section D)



94/97


Kurtosis


95/97

Sampling Distributions

Basic Concepts

In Inferential Statistics, we come up with

generalizations about the population using the

information that we collect from a sample. We

will require this sample to be a random

sample.


96/97

Sampling Distributions

The samplingdistributionofa statisticis its

probability distribution.

discrete- pmf

continuous pdf

The standarddeviationof a statistic is called its

standarderror.


97/97

Tests of Hypotheses

Basic Concepts in Testing Statistical Hypotheses

the first step in hypothesis testing is to identify andstate the statistical hypotheses to be tested.

Astatisticalhypothesis is a conjecture concerning one

or more populations whose veracity can be establishedusing sample data. The Null Hypothesis,denoted asHo, is a statistical hypothesis which the researcherdoubts to be true. The Alternative Hypothesis,denoted as Ha,is the operational statement of the

theory that the researcher believes to be true andwishes to prove and is contradiction of the nullhypothesis.

Lecture Stat GS

Documents

Transcript of Lecture Stat GS