SO5041: Quantitative Research Methodsteaching.sociology.ul.ie/so5041/slides8.pdf · SO5041:...

SO5041: Quantitative Research Methods

SO5041: Quantitative Research Methods

Brendan Halpin, Sociology

Autumn 2019/0

1

SO5041: Quantitative Research MethodsUnit 0: Course OutlineCourse outline

Unit 0: Course Outline

2


Course outline

Lecture: CG058 Mon 10h00-12h00PC Lab: A00600a Mon 13h00-15h00Lecturer: Dr Brendan Halpin

[email protected] 3147Office hours: Tuesday 14h00–17h00, F1-007 (or e-mail for appointment)

2


Course focus:

The role of empirical reasoning in sociology, especially using quantitative dataQuantitative social science data collection, especially the surveyHandling quantitative data: coding it onto a computer, organising it, presenting itStatistical analysis: making claims about the world using quantitative data –sampling and inference

3


Practical focus:

Using software to analyse data and prepare findingsStata: statistical software packageMicrosoft Excel: spreadsheet

Carrying out questionnaire-based researchBecoming a critical consumer of quantitative research

4


Planned outline

Introduction to quantitative method – using number to represent information,simple descriptive statistics and presentations. Using Stata to enter data andreport itSamples, surveys and probability – theory of how a sample can be used to describea population, elementary probability theory, basic questionnaire design and surveyimplementation. Manipulating data in Stata, more presentation.Statistical inference – the practice of using a sample to describe a population.Statistically informed use of Stata: testing difference of means, analysingassociation in tables.Linear regression and correlationRegression analysis with multiple explanatory variables?In parallel, reading and discussion of a small number of quantitative researchreports

5


Assessment

Assessment: three assignments (due ends of weeks 5, 9 and 13) totalling 60%Final exam 40%Repeat assessment: All assignments plus repeat exam August 2020

6


Detailed lecture sequence

Week 1 General introduction

Week 2 (1) Introducing Quantitative Social Research, (2) Surveys, Questionnaires and Sampling

Week 3 (3) Numbers as Information – Data, (4) Bivariate analyses

Week 4 (4) Bivariate analyses continuted, (5) More on types of variables

Week 5 (6) Sampling, (7) Distributions

Week 6 (8) More on Distributions, (9) Sampling Distributions and the Central Limit Theorem

Week 7 (10) Sampling and confidence intervals, (11) Two new distributions: t and χ2

Week 8 Bank Holiday

Week 9 (12) Questionnaire Design

Week 10 (13) Hypothesis testing, (14) More $t$-tests,

Week 11 (15) Correlation, (16) Regression

Week 12 (16) Regression continued, (17) Review

7


Lab sequence (subject to change)

Week 1 Logging on, running Stata, general intro (short lab)

Week 2 Simple data entry, plus some univariate analysis

Week 3 Bivariate analysis

Week 4 Editing data in Stata, missing values

Week 5 Using the Normal Distribution (by hand, spreadsheet)

Week 6 Confidence intervals for means and proportions (by hand, Stata)

Week 7 Understanding the standard deviation (spreadsheet exercises)

Week 8 Bank Holiday

Week 9 Chi-squared test (spreadsheet and Stata)

Week 10 Hypothesis test on a mean (spreadsheet and Stata)

Week 11 More hypothesis tests; correlation

Week 12 Regression analysis using Stata

8


Note on assignments

Cooperation between students is encouraged but assignments must be thestudent’s own workPlease refer to Dept Plagiarism Policy at http://www.ul.ie/sociology/ (under“Student Resources”)Please use the Dept Assignment Declaration form with all assignmentshttp://www.ul.ie/sociology/ (under “Student Resources”)Note the Dept’s policy on deadlines http://www.ul.ie/sociology/ (under“Student Resources”)

9


Texts

Main text: Agresti and Finlay, Statistical Methods for the Social Sciences – goodintroduction to formal statistical methods, very clear and accessible (will use moreextensively in second semester course)For Stata:

Robson and Pevalin, The Stata Survival ManualKohler and Kreuter, Data Analysis using StataAcock, A Gentle Introduction to Stata

Dip into Alan Bryman, Social Research MethodsAlso David de Vaus, Surveys in Social Research: good on survey methodologyOther occasional readings

10


Some resources on the web (local)

Examples and data to download for lab sessionsAll accessible from http://teaching.sociology.ul.ie/so5041/

Course SULIS site (especially for assignments, see also course outline)

11

SO5041: Quantitative Research MethodsUnit 1: Introducing Quantitative Social Research

Unit 1: Introducing Quantitative Social Research

12


Qual versus Quant?

Social research is often divided into qualitative and quantitative:Is this a real division?What exactly is quantitative social research?

In some respects the distinction is clear:Quantitative research is concerned with numbers and statistical analysis, typically oflarge-scale surveysQualitative research is less obviously concerned with number, and typically is smallerscale and more in-depth

‘Qualitative’ actually covers a diverse range of research and methodBut sometimes this division is over-stated, e.g., “qualitative sociology” versus“quantitative sociology”It is only at the level of method that the division is clear

12


Positivism–Interpretivism?

Some commentators see major philosophical divisionsquantitative ≡ positivistqualitative ≡ interpretivist, anti-positivist

While there are some parallels, this is misleading: quantitative research is notnecessarily positivist, and is not incompatible with interpretivist frameworksPositivism is a philosophy of science that holds that only that which can bemeasured existsHowever, it is often used to mean a naïvely scientific approach to social research,or as an inherently negative way of referring to the quantitative method

13


Positivism–Interpretivism?

In the past positivists have argued that social science should use exactly the samemethods as the natural sciences,

suggesting that social phenomena should be studied via measurement andobservation from the outside,strongly downplaying the role of understanding or of rational social action

14


Critique of positivism

Positivism has many critics, in sociology often from an interpretivist perspective:the interpretation of the meaning of action and communication is paramount forthis sort of social researchSome attacks on positivism have been extended to attacks on quantitativemethods, but it is mistaken to treat them as the sameMany interpretivists criticise the quantitative method for not being able to dealwith the complexity of meaning and context in social life – this is not the same asattacking it for being positivist

15


Quantitative approaches and rationality

A great deal of quantitative research works in terms of knowledgeable rationalactors

Weber (key theorist for interpretivists) himself carried out a number of quantiativeprojectsMuch neo-Weberian sociology uses quantitative method, e.g., “rational actiontheory” school

Moreover, to justify quantitative sociology it is sufficient that you can say someuseful things via “measurement”, and conversely, all naturalistic research is in asense dealing with measurement and observationIncreasing use of "mixed methods": both qual and quant in the same project

16


Limitations

Though quantitative research is not necessarily positivist, it has clear limitationsRigid, almost industrial method – less opportunity for a dialogue between theory anddata collection during the research processRestrictive: you can only deal with the data you haveWeak on rare or hard-to-trace populations, or on rare eventsInstitutional context: large investment required means substantial pressure to followthe pressures of funding

17


Powerful

But it is very powerful for certain types of questionwhether certain processes or phenomena are present in a given populationwhere the relative size of competing effects is importantindeed, wherever it is necessary to generalise to a large population

Probably true to say that the best research is methodologically ecumenical

18

SO5041: Quantitative Research MethodsUnit 1: Introducing Quantitative Social ResearchAn Anatomy of Quantitative Research

“Scientific” model?

Though not necessarily positivist, quantitative research has strong affinities with a“scientific” model of the research process

the Theory ⇒ Hypothesis ⇒ Test cycleFalsification in the Popperian sensean interest in causal relations

19


The experiment

The ideal type of this method is the experiment as, at least conceptuallysubjects allocated randomly to two groups,one control groupone “treatment” group exposed to the variable of interestwith the strong inference that differences in outcome between the two groups arecaused by the treatment

Experiments are, of course, almost always impossible in social research (and muchother research)

20


Observational data

The alternative is the observational model:collect data “in the field”, e.g., using surveys or “observation”use multi-variable statistical techniques to search for causal relations

Necessarily weaker than the experimental model: much harder to be sure you arenot missing another reason for a given effectGenerally harder to reason about causality with observational data: the evidenceyou have is simply association or correlation

21


Ambiguity

That is, your survey may show that unemployed people have lower mental healththan other categories (an “association” between employment status and mentalhealth), but cannot tell you why:

unemployment might simply be bad for youpeople with mental health problems may be more prone to lose jobsor a third factor may affect both (e.g., geography: in one region there may be higherunemployment and higher mental ill-health for unrelated reasons)

22


Taking time into account

Longitudinal observational designs helpif respondents are observed at two time points, and we systematically see peoplemoving from employment to unemployment and equanimity to depression (and viceversa), we can exclude the “third factor” argumentif we observe people more frequently, we may be able to observe the onset onunemployment before the onset of depression at an individual (or vice versa) andargue with more confidence for a causal relationship

23

SO5041: Quantitative Research MethodsUnit 2: Surveys, Questionnaires and Sampling

Unit 2: Surveys, Questionnaires and Sampling

24


Surveys and Survey Research

Social survey research is very widespread:political opinion polls and market researchEU-wide Labour Force Survey & CSO’s Quarterly National Household Survey (LFSplus)EU EurobarometerEuropean Social SurveyInternational Social Survey ProgrammeHousehold Budget SurveyUK Family Expenditure Survey, General Household SurveyGrowing up in Ireland, TILDASlán, Irish Study of Sexual Health and Relationshipssurveys of business opinion, of inventories etc.many emanating from ESRI or government

24


Surveys: representativity

The key principle of survey research is representativity: because the sample israndom, summaries of the sample’s characteristics can be imputed to the relevantpopulationSometimes we end up with too few cases of a subgroup to analyse – e.g., ethnicminorities; over-sampling or specially targetted surveys may help

25


Longitudinal surveys

Longitudinal surveys are a special casePanel surveys the same sample at regular intervals (e.g., European CommunityHousehold Panel, US Panel Study of Income Dynamics, German Socio-EconomicPanel, British ‘Understanding Society’ Panel Study)Retrospective studies ask respondents to report complete life histories retrospectively(Irish Mobility Study, UK Family and Working Lives Survey, German Life HistoryStudy, etc.)Cohort studies take a group of subjects and follow them forward (e.g., the GrowingUp in Ireland study, The Irish Longitudinal Study on Ageing)

Taking time into account makes these in many ways much richer data sources

26


Questionnaire design

The questionnaire is the linchpin of the surveyMust elicit right information with minimum of ambiguity or suggestion, minimuminconvenience to the respondentQuestion design is a black art, since small changes of phrasing may cause differentresultsExtensive reliance on standardised questions, or standardised forms of questions(e.g., the typical five point answer scale: strongly agree, agree, neutral, disagree,strongly disagree)Standard schedules exist for certain purposes, e.g., the General HealthQuestionnaireV important to minimise “open” questions: much cheaper to pre-code answers (butallow an “other, specify” answer)Very important to test questionnaires in a pilot survey, to trap ambiguities andother problems, and to help pre-code questions

27

SO5041: Quantitative Research MethodsUnit 2: Surveys, Questionnaires and SamplingInference

Inference, sampling and statistics

The basis of survey research is inference: the projection of the characteristics ofthe sample onto the populationThis requires a random (or quasi-random) sample: each member of the populationof interest has an equal chance of being selectedConsider a simple summary: the mean

x =

∑xin

28


Calculations on samples

We could calculate the mean for the entire population but that would be expensiveWe can calculate the mean for a random sample: much cheaper and the answerapproximates the true answer in a probabilistic wayThat is, if we were to draw a large number of samples and calculate all the samplemeans, these sample means would be distributed about the true (population) valuewith an approximately normal distributionIn only drawing one sample, we infer that the true mean lies probabilistically aroundthe sample mean, with a large sample giving a tighter distribution that a small one

29


Error, but maybe not too much

We can also calculate from the data how wide this probability distribution is: thisis the basis for “significance”, a measure of how much the sample summary can betrustedIt is usually the case that a sample a tiny fraction the size of the population willprovide good estimatesThe same reasoning applies to means, frequency distributions, correlationcoefficients, cross-tabulations, regression coefficients etc.

30


Multi-variable techniques

Multi-variable techniques are the key to causal analysis: take account of the effectsof many variables simultaneously to estimate the net effect of eachRegression analysis is the most commonly used multi-variable technique, and issimilar in principle to most multi-variable techniques - It is used with a dependentvariable that is continuous (i.e., like age, time, income etc., rather than religion,sex, nationality which are categorical)

31

SO5041: Quantitative Research MethodsUnit 3: Numbers as Information – DataProcessing information numerically

Unit 3: Numbers as Information – Data

32


Number as information

In the Lab we will begin by coding data in numerical form, entering it on thecomputer and using Stata to make simple descriptive summaries of itThis activity is an important stage of the quantitative research process: turninginformation into numbers and numbers into information

32


Steps

There are several important steps in this processDetermine a clear set of information items to collect (i.e., what questions you wantto ask)Determine a simple way of representing this information

Represent it as a single “quantitative” number, e.g., Euros per annum for income, orDetermine a fixed set of categories grouping every possible answer, and attach anumber to each category

Collect it (easier said than done) and enter on a computerBut for the analysis to make sense we need to bring in information about that thenumbers mean: variable labels and value labels restore some part of the meaningfulinformation hidden by coding as numbers

33

SO5041: Quantitative Research MethodsUnit 3: Numbers as Information – DataUnivariate summaries

Descriptive summaries

When we enter data on a computer we can easily present descriptive univariatesummaries (i.e., summarising one variable at a time)For categorical or grouped variables (i.e., where there is a “small” number ofdifferent values) we can use a frequency table: how many of each sort are there?what proportion of each sort are there?For variables with a “quantitative” interpretation (i.e., where the number measuressomething like age, income, duration, distance, or where it is a count like numberof children, number of cigarettes per day) we can explore the mean or average, orperhaps the median

34


Frequency Table

. tab rjbstatcurrent economic

activity Freq. Percent Cum.

self-employed 929 6.82 6.82employed 6,749 49.58 56.41

unemployed 468 3.44 59.84retired 3,034 22.29 82.13

maternity leave 75 0.55 82.68family care 786 5.77 88.46

ft studt, school 878 6.45 94.91lt sick, disabld 608 4.47 99.38gvt trng scheme 22 0.16 99.54

other 63 0.46 100.00

Total 13,612 100.00

35


Summarising a “quantitative” variable

. su rfimnVariable Obs Mean Std. Dev. Min Max

rfimn 13,612 1454.687 1459.462 0 56916.67

36


Codebook

. codebook rfimn

rfimn total income: last month

type: numeric (double)label: rfimn, but 9658 nonmissing values are not labeledrange: [0,56916.668] units: 1.000e-08

unique values: 9,658 missing .: 0/13,612examples: 470.71835

925.075011401.99292175.7588

37


Mean, Median

The mean is defined as the sum total of the variable, divided by the number ofcases:

x =

∑xin

The median is defined as the value such that 50% of cases are lower and 50% ofcases are higher

38


Calculating the median

To calculate the median “by hand”:sort the values in orderif there is an odd number of cases (e.g., 101), find the middle one (e.g., 51st) – itsvalue is the medianif there is an even number of cases (e.g., 100), find the middle pair (e.g., 50th and51st) – the median is half way between their values

39


Median via Stata

. centile rfimnBinom. Interp.

Variable Obs Percentile Centile [95% Conf. Interval]

rfimn 13,612 50 1141.668 1119.325 1166.667

40

SO5041: Quantitative Research MethodsUnit 3: Numbers as Information – DataUnivariate graphs

Graphs

We can also make graphical summariesFor categorical variables the bar-chart is very good: this is like a frequency tablewhere the height of the bars is proportional to the number/proportion in thatcategoryWe can also use Pie-charts: these are circles with segments (pie-slices) whoseangle is proportional to the size of the category (there are some arguments thatbar-charts are better)For quantitative variables we can use histograms: these are very like bar-charts,but break the “continuous” variable into groups. The bars in a histogram touch, toshow that the variable is really continuous. In a histogram, the area under the“curve” (or stepped line) for a given range is proportional to the numbers in thatrange. It is sometimes interesting to vary the group size (the “bin size”) to getmore or less detail

41


Bar chart

graph hbar, over(rjbstat)

42


Zero is important

43


Beware: People lie with barcharts

44


Pie chart

45


Pie is bad for you

Source: http://junkcharts.typepad.com/junk_charts/2009/08/community-outreach-piemaking.html

46


Histograms

hist rfimn if rfimn<=16000

47


Different bin widths

hist rfimn if rfimn<=16000, bin(10)hist rfimn if rfimn<=16000, bin(100)

48


Some interesting reading on graphics

William S Cleveland The Elements of Graphing Data. Rev. ed. Murray Hill, N.J.:AT&T Bell Laboratories, 1994Edward Tufte, The Visual Display of Quantitative Information. Cheshire, Conn.:Graphics Press, 1983Kieran Healy, Data Visualization: A Practical Introduction 2018. Focused on R,but Chapter 1 is a general discussion. Online at https://socviz.co/

49

SO5041: Quantitative Research MethodsUnit 4: Bivariate analyses

Unit 4: Bivariate analyses

50


Types of variables

So far i have been using several terms for types of variable:CategoricalGroupedContinuousQuantitative

A more formal classification that corresponds with the sorts of analysis possible,goes:

NominalOrdinalIntervalRatio

50


NOIR – categorical

Nominal variables have categories where each category is “just a name”, i.e.,nominal: example: religion, region of birth, political allegienceWith nominal variables we can’t do much more than present frequenciesOrdinal variables have categories where each category is something different, butwhere it is possible to put the categories in a meaningful high–low order. Examplesinclude highest maths qualification, exam letter grades, attitudes (agree strongly,agree, neutral, disagree, disagree strongly) and so on. with ordinal variables,frequencies are meaningful, and the “cumulative percent” column that Stataprovides is also meaningful. Additionally, with ordinal variables we can calculatethe median

51


NOIR – quantitative

Quantitative variables can be interval or ratioInterval variables are variables like temperature, where the difference between, say,35° and 40° is the same as between 50° and 55°. That is, the meaning of aninterval is the same no matter where it is. However, with temperature (centigradeor fahrenheit) it is not true that 40° is twice as hot as 20°, because 0° is not a truestarting point. With interval variables it makes sense to calculate the mean. Thatis, if we have five days with temperatures of 11°, 9°, 12°, 11° and 10°, it makessense to say the average is 11+9+12+11+10

5 = 10.6

52


NOIR – ratio variables

ratio variables are like interval variables, and are probably more commonlyencountered. these are quantitative variables, where the zero makes sense.distance, duration, money, age are all ratio variables, because 0 kilometres, 0seconds, €0 and 0 years are all things that make sense. ratio variables can also usemean, but unlike interval variables we can also work with proportions (twice as old,twice as rich, 10% farther and so on)

53

SO5041: Quantitative Research MethodsUnit 4: Bivariate analysesBivariate Summaries

Recap: Univariate analysis

Frequencies: tab foreign

Pie chart: graph pie, over(foreign)

Bar chart: graph bar, over(foreign)

Medians, etc: centile mpg, centile(25 50 75)

Mean, standard deviation: su mpg

Histogram: histogram mpg

54


Bivariate Analysis

So far we have dealt with describing one variable at a time: important but limitedBi-variate (two-variable) summaries allow us to look at relationships betweenvariables as wellWhere both variables are categorical (nominal, ordinal or grouped) we can usetwo-way tables, cross-tabulations

55


Cross-tabulation

. use http://teaching.sociology.ul.ie/so5041/labs/week3.dta

. tab empstat sexusual employment school leavers sex

situation 1 2 Total

working for payment 388 380 768unemployed 67 46 113

looking for 1st job 170 151 321student 471 490 961

engaged in home dutie 0 19 19permanent disability/ 4 0 4

other 4 7 11

Total 1,104 1,093 2,197

56


Column Percentages. use http://teaching.sociology.ul.ie/so5041/labs/week3.dta. tab empstat sex, col

Key

frequencycolumn percentage

usual employment school leavers sexsituation 1 2 Total

working for payment 388 380 76835.14 34.77 34.96

unemployed 67 46 1136.07 4.21 5.14

looking for 1st job 170 151 32115.40 13.82 14.61

student 471 490 96142.66 44.83 43.74

engaged in home dutie 0 19 190.00 1.74 0.86

permanent disability/ 4 0 40.36 0.00 0.18

other 4 7 110.36 0.64 0.50

Total 1,104 1,093 2,197100.00 100.00 100.00

57


Too many percents!. use http://teaching.sociology.ul.ie/so5041/labs/week3.dta. tab empstat sex, row col

Key

frequencyrow percentage

column percentage

usual employment school leavers sexsituation 1 2 Total

working for payment 388 380 76850.52 49.48 100.0035.14 34.77 34.96

unemployed 67 46 11359.29 40.71 100.006.07 4.21 5.14

looking for 1st job 170 151 32152.96 47.04 100.0015.40 13.82 14.61

student 471 490 96149.01 50.99 100.0042.66 44.83 43.74

engaged in home dutie 0 19 190.00 100.00 100.000.00 1.74 0.86

permanent disability/ 4 0 4100.00 0.00 100.00

0.36 0.00 0.18

other 4 7 1136.36 63.64 100.000.36 0.64 0.50

Total 1,104 1,093 2,19750.25 49.75 100.00

100.00 100.00 100.00 58


Ordinal variable. tab mopfama mopfamb, row

Key

frequencyrow percentage

pre-school childsuffers if mother family suffers if mother works full-time

works strongly agree neithr ag disagree strongly Total

strongly agree 572 328 79 32 21 1,03255.43 31.78 7.66 3.10 2.03 100.00

agree 267 2,507 876 354 37 4,0416.61 62.04 21.68 8.76 0.92 100.00

neithr agree, disagre 55 813 3,070 1,010 95 5,0431.09 16.12 60.88 20.03 1.88 100.00

disagree 32 295 437 2,777 344 3,8850.82 7.59 11.25 71.48 8.85 100.00

strongly disagree 6 23 40 164 680 9130.66 2.52 4.38 17.96 74.48 100.00

Total 932 3,966 4,502 4,337 1,177 14,9146.25 26.59 30.19 29.08 7.89 100.00

59

SO5041: Quantitative Research MethodsUnit 4: Bivariate analysesBivariate graphs

Bivariate graphs

Graphical ways of presenting the same information as a cross-tabulation includeclustered and stacked bar charts:Clusters are good for looking at distributions within groupsStacked bars are good where it is important to emphasise the sizes of the groups

60


Clustered bar chart

0 100 200 300 400 500Number of respondents

other

student

looking for 1st job

unemployed

working for payment

Male Female

61


Stacked bar chart

0 200 400 600 800 1,000Number of respondents

other

student

looking for 1st job

unemployed

working for payment

Male Female

62


Compare means – continuous within categorical

Where one of the variables is categorical and the other is continuous (interval orratio) we can “compare means”. That is, instead of calculating mean income, wecan calculate mean income for different groups

63


Wage by occupational group

. sysuse nlsw88, clear(NLSW, 1988 extract). tab occupation, su(wage)

Summary of hourly wageoccupation Mean Std. Dev. Freq.

Professio 10.723624 6.3510736 317Managers/ 10.899784 7.5215875 264

Sales 7.1544893 5.0427568 726Clerical/ 8.5166856 8.5653995 102Craftsmen 7.152988 3.7629626 53Operative 5.6538061 3.3861932 246Transport 3.2004937 1.3209668 28Laborers 4.9058895 3.6083499 286Farmers 8.0515299 0 1

Farm labo 3.0837354 .76638443 9Service 5.9887432 2.1399333 16

Household 6.3888881 .31313052 2Other 8.8362936 4.128914 187

Total 7.7779283 5.7613599 2,237

64


Another use of bar charts

0 2 4 6 8 10mean of wage

Other

Household workers

Service

Farm laborers

Farmers

Laborers

Transport

Operatives

Craftsmen

Clerical/unskilled

Sales

Managers/admin

Professional/technical

65


Boxplot

0 10 20 30 40hourly wage

Other

Household workers

Service

Farm laborers

Farmers

Laborers

Transport

Operatives

Craftsmen

Clerical/unskilled

Sales

Managers/admin


66


Boxplot

The boxplot summarises the distribution of a variable:50% of the distribution is within the boxThe bar in the middle is the medianThe “whiskers” extend to the minimum and maximum, excluding “outliers”Outliers are cases more than 1.5 IQR above or below the corresponding quartile(these are often marked separately)

67


Three-way analysis

0 2 4 6 8 10mean of wage

Other

Household workers

Service

Farm laborers

Farmers

Laborers

Transport

Operatives

Craftsmen

Clerical/unskilled

Sales

Managers/admin


nonunion union

68


Scatterplot

Where both variables are continuous, a very useful device is the scatterplot:

02000

4000

6000

8000

tota

l in

com

e: la

st m

onth

0 20 40 60 80 100no. of hours normally worked per week ~~

0100

200

300

400

net pay last w

eek

0 100 200 300 400 500gross pay last week

69


Background Reading

Agresti and Finlay, Statistical Methods for the Social Sciences, chapter 1(background) and chapter 3 (descriptive statistics)To follow: A&F, chapter 2, Sampling and Measurement%, F&G

70

SO5041: Quantitative Research MethodsUnit 5: More on types of variables

Unit 5: More on types of variables

71


Spread in quantitative variables

When we have quantitative variables like age,income, we use measures of central tendencylike mean and median to describe the “middle”However, the dispersion about the middle isalso important: completely differentdistributions can have the same mean:

A B C1 5.5 1.02 5.5 4.03 5.5 4.54 5.5 5.25 5.5 5.36 5.5 5.67 5.5 5.98 5.5 6.09 5.5 7.5

10 5.5 10.0-------------

Mean 5.5 5.5 5.5

71


Spread in quantitative variables

0.2

.4.6

0 5 10 0 5 10 0 5 10

1 2 3

Density

Three distributions with same mean

72


Summarising spread

As well as summarising the middle, we often want to summarise the spread:RangeInter-quartile rangeStandard Deviation

The range is important, but depends too much on just the two extreme casesThe IQR is much more stableThe standard deviation has useful statistical properties

73


The “Standard Deviation”

The standard deviation is calculated from the “deviations” from the mean, X :

Deviation = Xi − X

The deviations add to zero so we can’t just add them up: instead square them andthen add them up: ∑

(Xi − X )2

To get a sort of average, we divide by sample size minus 1, and take the squareroot:

σ =

√∑(Xi − X )2

n − 1

74


Standard Deviation: good measure of spread

The standard deviation indicates how spread out the data is: the more spread out,the bigger the StDevIt’s a good measure because it depends on every single case, not just the extremepairs (range) or the quartilesIt has useful statistical properties which help when we come to statistical inference

75

SO5041: Quantitative Research MethodsUnit 5: More on types of variablesMore on types of variables

More on types of variables

We have already divided variables into nominal/ordinal/interval/ratioWe can also differentiate between “qualitative” and “quantitative”:

Nominal variables are clearly qualitative: observations with different values havedifferent qualitiesInterval/ratio variables directly represent quantities: amount of money, number ofchildren, distance

76


What about ordinal variables?

Where do ordinal variables (e.g., level of education) fit in?In between?Often treated as qualitative – categories are clearly “different”Sometimes treated as quantitative: e.g., letter-grades are given scores for calculatingQCA; “Likert” scale answers (strongly agree to strongly disagree) are given scores of-2, -1, 0, 1, 2Applying scores to an ordinal variable implies there is an underlying interval/ratiovariable which we do not measureIn the QCA example this is true: the individual exam marks are on a ratio scaleIn the likert-scale example, we may feel there is a continuum ofagreement–disagreement which is approximately measured by the five options

77


Sensitivity

There, however, we may feel that the steps between neutral and (dis)agree arebigger than between (dis)agree and strongly (dis)agree so we might want thescores to go -3, -2, 0, 2, 3 (or -3, -1, 0, 1, 3)When we base an analysis on applied scores like these, it is good to do a“sensitivity analysis”: compare the results with different scoring systems to see howsensitive your analysis is to the scoring method

78


Summary of 5-point ordinal scale

Question VP P Neutral G VG N ScoreOverall satisfaction with 1.4 1.7 23.2 52.6 21.1 289 0.781Department of SociologyNational reputation of 0.7 2.5 16.5 36.6 43.7 279 0.840University of LimerickOverall quality of academic 0.0 5.1 13.6 54.2 27.1 295 0.807programme within ULOverall quality of services 1.0 7.7 20.1 47.8 23.4 299 0.770provided by ULOverall quality of the social 0.7 3.2 23.6 36.6 35.9 284 0.808aspects at UL servicesQuality of Student Academic 3.9 12.5 33.9 37.1 12.5 280 0.684Administration officeQuality of Fees office 2.9 7.5 47.3 33.5 8.8 239 0.675

79


Yet another distinction!

A new distinction: discrete versus continuousDiscrete variables have a finite number of possible values

Nominal and ordinal variables are by definition discreteCount variables are discrete: 0, 1, 2, 3 . . . – you can’t actually have 2.4 childrenGrouped interval variables are discrete

Continuous variables can have an infinite number of values in principle: 2, 2.4,2.43, 2.435, 2.4358 and so on.Ungrouped interval/ratio variables are continuous in principle: an infinite numberof values are possibleIn practice, we treat income as continuous though it can’t vary in amounts lessthan 1 cent (and age, though we often measure it in integer years, or to thenearest month, etc.)

80

SO5041: Quantitative Research MethodsUnit 6: Sampling

Unit 6: Sampling

81


Sample not Census

Most quantitative research proceeds by using samples: rather than ask the entireelectorate every month or so who they would vote for, ask a random sampleIf these individuals are chosen at random, their responses will approximate those ofthe whole electorateRandom here does not mean completely without control, rather that eachindividual in the reference population has an equal chance of selection

81


Simple Random Sampling

The basic sampling strategy is the simple random sampleevery subject in the population has the same chance of being selectedevery possible sample of that size has the same chance of being selected

How to select a random sample:Acquire a sampling frame: a list of all individuals in the reference populationNumber the individuals from 1 to N, the size of the populationDraw n (sample size) numbers from a random number table or a computer, in therange 1 to NInterview the corresponding individuals

82


Random numbers

In the past random number tables were widely used:

Line/Col (1) (2) (3) (4)01 96252 48687 59641 9946002 71334 10218 07459 2033903 35932 10229 83514 7646104 33568 36397 92080 9843005 31535 29990 89596 7752906 16483 30849 18676 6322507 26374 60736 14522 6609608 12040 90130 91860 08280

Now computers do the job very conveniently: for instance in Excel typing =rand()in a cell gives you a random number between 0 and 1.

83


Probability sampling

Simple random samples are an example of probability samplingProbability sample has the enormously important theoretical advantage ofrepresentativityIf properly conducted, the sample is guaranteed to be representatitiveThat is, all sample characteristics approximate those of the population – evencharacteristics you haven’t thought ofNon-probability sampling does not have these advantages but is sometimes usedfor other reasons

84

SO5041: Quantitative Research MethodsUnit 6: SamplingNon-random sampling

Non-probability sampling

There are many forms of non-probability sampling:volunteer sampling (self-selection)“streetcorner interview”quota sampling

Volunteer sampling is popular in the media: questionnaires in magazines, phone-inlines (vote for/against TV programme issue)Sometimes the sample is very largeHowever, because respondents are self-selecting serious bias can enter

85


Volunteer sample – dodgy inference!

Agresti/Finlay quote several examples (pp 20–21): e.g., TV phone vote on “shouldUN stay in US?” had a response of 186,000 with 67% voting for “No”A simultaneous random sample of 500 had 28% voting no: much smaller but farmore reliableTV phone-in subject to bias:

Who is watching the program? Time of day, interest, etc.Who is motivated to phone? Those with a strong opinion on the subject

Twitter polls a particularly bad example of volunteer samples ("retweet for a biggersample!")

86


“Opportunistic” samples

“Streetcorner interview” sampling simply interviews whoever comes along: randomin that senseHowever, where and when the interviewing is done will affect who is likely to beinterviewed: under-represent workers if during the working day, under-representcountry people if done in the city, over-represent shoppers if done in shopping area,etc.{}Quota sampling is a form of sampling that avoids these bias problems by usingquotas (of age group, sex, employment status, income group, region etc.) – peoplepassing by at “random” are interviewed only if they fit the quotaNot really probability sampling, but often a very effective and efficient way of“faking” a true random sample, especially for well-understood purposes like opinionpolling

87

SO5041: Quantitative Research MethodsUnit 6: SamplingCharacteristics of samples

Samples are representative

The characteristics of a random sample approximate those of the referencepopulation by virtue of randomisationFor instance, mean income in a sample (the “*sample statistic*”) will bemore-or-less close to the mean income for the population from which it is drawn(the “*population parameter*”)The sample statistic is a more-or-less good estimator of the population parameterThis is in general true for any statistic: proportion answering yes, distribution ofqualifications, average working hours, joint distributions (e.g., crosstabs) etc.{}

88

SO5041: Quantitative Research MethodsUnit 6: SamplingCharacteristics of samples

Samples are approximate

But because it comes from a sample, the sample statistic will not be an exactestimate of the population parameterThe sample statistic depends on chance: the exact contents of the sampleRepeating the exercise with a different sample will give a different sample statistic– still approximating the population parameter, of courseIn principle we can think of there being a distribution of possible sample statisticvalues, all falling more or less close to the true value of the population parameter

89

SO5041: Quantitative Research MethodsUnit 6: SamplingError in sampling

Sampling error

This difference between the statistic and the true value is called sampling errorIf the true proportion of UL students with part-time jobs is 65% and a samplereports 72% we have a sampling error of 72 – 65 = 7%How useful a sample is depends on the size of sampling error

The bigger the sample the smaller the sampling errorThe more variability in the population the larger the sampling error

How do we know how big the sampling error is? We don’t know the true value!With certain assumptions about the range of the true value and information aboutthe dispersion of the sample we can estimate it – e.g., with a sample of 1,000sampling error for proportions (%age working etc.) is around 3–4%

90


Systematic error

Sampling error is due to sampling variability: always presentOther sources of error can arise – systematic error is a problemError can arise from sampling problems

Undercoverage – e.g., homeless people, geographically mobile peopleBad sampling frame – e.g., out-of-date electoral register

Error can also arise from response problemsOutright refusal – “non-response”Refusal to answer certain questions – “item non-response”Bias or deception in answers

91


Survey design and error

Sampling error is part of the designNon-sampling error is a problem and can invalidate the conclusions we wish tomake – bias the sample away from representativity of the reference populationUndercoverage or non-response introduces the possibility that the sample we getdiffers systematically from the population (by differing from the part we don’t get)For instance, if we are interested in financial wellbeing and our surveyunderrepresents homeless people or people who move very often the people in thesample may be systematically better off than those we fail to traceIf UL students working part-time are too busy to respond to a questionnaire oursample will under-estimate the average hours spent working part-time

92

SO5041: Quantitative Research MethodsUnit 6: SamplingMore complex sampling strategies

More sampling strategies

We have discussed several types of sampling strategySimple random samplingVolunteer sampling and “street-corner interviewing”Quota sampling

Random sampling is probability sampling; the others are notOther forms of probability sampling also exist:

Systematic samplingStratified samplingCluster samplingMultistage sampling

93


Systematic sampling

Systematic sampling is used where the sampling frame has a more-or-less randomorderPick a name at random near the start of the sampling frameSkip k cases, pick the next name, and continue until finishedk , the size of the skip, is calculated as N/n, so that you arrive at the full size ofthe desired sampleThis works as a simple random sample, on the assumption that there is norelationship between where you are in the list and the relevant characteristics youhaveIf there is any order to the sampling frame (like people being grouped together, forinstance, or periodicity) this will not yield a random sample

94


Stratified sampling

Stratified sampling works by dividing the population into strata and collectingrandom samples within the strataIf the groups or strata are defined according to variables important to the study(e.g., age, gender, occupational group) this method yields statistically moreefficient samples (i.e., sampling error is less than with a simple random sample)Where we know the population distribution of age or gender (e.g., from a census)our sample will automatically have the right proportionsThis method also allows over-sampling of small groups: ethnic minorities, peopleon FÁS schemes, the unemployed – this allows comparisons small groups andothers that would not be possible in a simple random sample

95


Cluster sampling

Cluster sampling divides the population into groups called clustersThese are often geographic areas, or organisationally basedA random sample of clusters is drawnWithin each cluster, simple random samples of individuals are drawn

96


Cluster sampling – pros and cons

Advantages:Cost – much less interviewer travel timeCan draw better random samples within cluster than from national sampling frame(e.g., identify all households in cluster first, then sample: better than out-of-date listof addresses)

Disadvantage: statistically less efficient – larger sample needed for same accuracy

97


Multistage sampling

Multistage samples use several of these methods in combinationFor instance, using electoral wards as clusters, take a simple random sample ofclusters; within each ward, treat streets as clusters and sample again; within eachstreet identify every separate address and do a systematic sample of every nthhousehold

98


Reading

A&F Chapter 2Bryman Ch 8 (see also Ch 7)

99

SO5041: Quantitative Research MethodsUnit 7: DistributionsCharacteristics of distributions

Unit 7: Distributions

100


Distributions

We have seen how to display and summarise the distribution of variables:Categorical: frequency distribution, percentage distribution, bar and pie chartsQuantitative (interval/ratio): mean, median, IQR, standard deviation, histogram,box-plot

100


Distributions have shapes

The shape of the histogram tells us about the distribution of the variableIf a variable is “uniformly” distributed we see a flat distribution between theextremes:

05

.0e

−0

4.0

01

De

nsity

0 200 400 600 800 1000flat

101


Heaped distributions

More often we see “heaped distributions” where more of the observations clusteraround the centre, like this example (age) from the 1999 school-leavers’ survey:

0.1

.2.3

De

nsity

15 20 25 30 35 40age

102


Many patterns

There are many patterns we might see:UniformExtremesBimodalUni-modal

103


Polarisation

http://www.fivethirtyeight.com

104


Uni-modal differences

Symmetric (with different levels of kurtosis)platykurtic – flattermesokurtic – averageleptokurtic – very concentrated around centre

AsymmetricPositively skewed (to right)Negatively skewed (to left)

105

SO5041: Quantitative Research MethodsUnit 7: DistributionsThe Normal distribution

Mathematically defined distributions

There are some patterns that are defined “theoretically” or mathematicallyAmong these the most important is the Normal Distribution:

Mean

St. Deviation

This is the famous “bell-shaped curve”

106


The Normal Distribution

The normal distribution issymmetric (no skew)mesokurtic (between flatter and pointier)

The mean, mode and median are the sameThe farther you go from the mean, the lower the proportion of cases, in eachdirection symmetrically

107


Histograms display distributions

We can see what a histogram of a variable drawn from a normally distributedpopulation looks like:

0.1

.2.3

.4.5

Density

−2 −1 0 1 2normal %

0.1

.2.3

.4D

ensity

−4 −2 0 2 4normal %

0.1

.2.3

.4D

ensity

−4 −2 0 2 4normal

100 cases 1,000 cases 10,000 cases

As the sample gets bigger, the histogram approximates the theoretical distributionbetter

108


Normal: mean and σ

What makes the normal distribution useful is that its form is well understood:It is completely characterised by its mean and its standard deviation

Mean

Same mean, different SD Same SD, different mean

Online app: http://teaching.sociology.ul.ie:3838/apps/normsd

109


Normal distribution: well-understood

90% of distributionbetween 1.645and -1.645

95% of distribution

and -1.960

97.5% of distributionbetween 2.241

between 1.960

and -2.241

0 1.962.24

1.6450-1.96

-2.24

-1.645-1 1

68% of distributionbetween 1.0and -1.0

About 68% of the cases in a normal distribution are within 1 SD on either side of its mean

95% are within ± 1.96 standard deviations of the mean

97.5% are within ± 2.24 standard deviations of the mean

110

SO5041: Quantitative Research MethodsUnit 8: More on DistributionsCharacteristics of distributions

Unit 8: More on Distributions

111


Histograms and distributions

Histograms display distributionsProbability distributions describe them formallyThe set of ticket numbers in a raffle has a uniform distribution

flat histogramequal numbers of tickets in all ranges, e.g., 1–100, 101–200, 201–300Thus winning ticket is equally likely to fall in any range: number is not related tochance of selection

111


Heaped histograms

However, if we were to pick a school-leaver at random from the school-leaver’ssurvey, there is a relationship with date of birth:

dates nearer June 1974 more likelydates much later or much earlier much less likely

. . . a clustered distribution

112


Shape and standard deviation

The extent of clustering depends on shape, standard deviationSmaller standard deviation means individual chosen at random more likely to fallnear the meanThe Normal Distribution a kind of “ideal type” of clustered symmetric distribution

113


Examples of normally distributed variables

IQ scores, GRE scores – designed that wayTime for a fast-food delivery – possibly normal, e.g., mean of 30 mins, standarddeviation of 5 minsAlcohol consumption of non-abstainers, e.g., mean of 20 units per week, standarddeviation of 7 unitsHow far darts hit relative to their target

114


Mean and St Deviation are enough

Combining the mean, standard deviation and the “knowledge” that the distributionis normal means we can judge

the proportions above, below or between any given valuesthe chance of a case chosen at random of falling in any given range

115

SO5041: Quantitative Research MethodsUnit 8: More on DistributionsThe Standard Normal Distribution

Standard Normal Distribution

The Standard Normal Distribution is a special case of the normal distribution:Mean = 0Standard deviation = 1

We can map any given ND onto it bysubtracting the meandividing by the standard deviation

We can thus use the SND to estimate probabilities/proportions for any normaldistribution, once we know the mean and standard deviation

116


Standard Normal Distribution TableTable of the Standard Normal Distribution

Right tail (probability of X > z).00 .01 .02 .03 .04 .05 .06 .07 .08 .09

.00 .5000 .4960 .4920 .4880 .4840 .4801 .4761 .4721 .4681 .4641

.10 .4602 .4562 .4522 .4483 .4443 .4404 .4364 .4325 .4286 .4247

.20 .4207 .4168 .4129 .4090 .4052 .4013 .3974 .3936 .3897 .3859

.30 .3821 .3783 .3745 .3707 .3669 .3632 .3594 .3557 .3520 .3483

.40 .3446 .3409 .3372 .3336 .3300 .3264 .3228 .3192 .3156 .3121

.50 .3085 .3050 .3015 .2981 .2946 .2912 .2877 .2843 .2810 .2776

.60 .2743 .2709 .2676 .2643 .2611 .2578 .2546 .2514 .2483 .2451

.70 .2420 .2389 .2358 .2327 .2296 .2266 .2236 .2206 .2177 .2148

.80 .2119 .2090 .2061 .2033 .2005 .1977 .1949 .1922 .1894 .1867

.90 .1841 .1814 .1788 .1762 .1736 .1711 .1685 .1660 .1635 .16111.00 .1587 .1562 .1539 .1515 .1492 .1469 .1446 .1423 .1401 .13791.10 .1357 .1335 .1314 .1292 .1271 .1251 .1230 .1210 .1190 .11701.20 .1151 .1131 .1112 .1093 .1075 .1056 .1038 .1020 .1003 .09851.30 .0968 .0951 .0934 .0918 .0901 .0885 .0869 .0853 .0838 .08231.40 .0808 .0793 .0778 .0764 .0749 .0735 .0721 .0708 .0694 .06811.50 .0668 .0655 .0643 .0630 .0618 .0606 .0594 .0582 .0571 .05591.60 .0548 .0537 .0526 .0516 .0505 .0495 .0485 .0475 .0465 .04551.70 .0446 .0436 .0427 .0418 .0409 .0401 .0392 .0384 .0375 .03671.80 .0359 .0351 .0344 .0336 .0329 .0322 .0314 .0307 .0301 .02941.90 .0287 .0281 .0274 .0268 .0262 .0256 .0250 .0244 .0239 .02332.00 .0228 .0222 .0217 .0212 .0207 .0202 .0197 .0192 .0188 .01832.10 .0179 .0174 .0170 .0166 .0162 .0158 .0154 .0150 .0146 .01432.20 .0139 .0136 .0132 .0129 .0125 .0122 .0119 .0116 .0113 .01102.30 .0107 .0104 .0102 .0099 .0096 .0094 .0091 .0089 .0087 .00842.40 .0082 .0080 .0078 .0075 .0073 .0071 .0069 .0068 .0066 .00642.50 .0062 .0060 .0059 .0057 .0055 .0054 .0052 .0051 .0049 .00482.60 .0047 .0045 .0044 .0043 .0041 .0040 .0039 .0038 .0037 .00362.70 .0035 .0034 .0033 .0032 .0031 .0030 .0029 .0028 .0027 .00262.80 .0026 .0025 .0024 .0023 .0023 .0022 .0021 .0021 .0020 .00192.90 .0019 .0018 .0018 .0017 .0016 .0016 .0015 .0015 .0014 .00143.00 .0013 .0013 .0013 .0012 .0012 .0011 .0011 .0011 .0010 .00103.10 .0010 .0009 .0009 .0009 .0008 .0008 .0008 .0008 .0007 .00073.20 .0007 .0007 .0006 .0006 .0006 .0006 .0006 .0005 .0005 .00053.30 .0005 .0005 .0005 .0004 .0004 .0004 .0004 .0004 .0004 .00033.40 .0003 .0003 .0003 .0003 .0003 .0003 .0003 .0003 .0003 .00023.50 .0002 .0002 .0002 .0002 .0002 .0002 .0002 .0002 .0002 .00023.60 .0002 .0002 .0001 .0001 .0001 .0001 .0001 .0001 .0001 .00013.70 .0001 .0001 .0001 .0001 .0001 .0001 .0001 .0001 .0001 .00013.80 .0001 .0001 .0001 .0001 .0001 .0001 .0001 .0001 .0001 .00013.90 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .00004.00 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000

117


Reading the table

The table is best suited to telling us theproportion of the distribution, or equivalently,the chance of picking a case at random above a given value above the mean

This is because it starts at zero (the mean of the SND) and goes up onlyHowever, since the normal distribution is symmetrical we can use this table forvalues below the mean too

118


Proportion above or below a given level above the mean

For example, given mean 100 and standard deviation 20, what’s the chance ofobserving a value above 130?

µ = 100, σ = 20,X = 130

⇒ z =X − µσ

=130− 100

20= 1.5

From the table, we see that z=1.5 corresponds to p=0.0668 or about 6.7%: 6.7%of the distribution is above 130Clearly, 100% - 6.7% = 93.3% of the distribution is below 130

119


Proportion below a given level below the mean

We use the symmetry to make calculations below the mean

For example, for the same distribution, what’s the chance of observing a valuebelow 85?

µ = 100, σ = 20,X = 85

⇒ z =X − µσ

=85− 100

20= −0.750

By symmetry, the chance of being below -0.75 is exactly the same as that of beingabove +0.75, so we read that:z = 0.75 ⇒ p = 0.2266, which means 22.7% of the distribution is below 85 (and77.3% = 100% - 22.7% is above 85)

120


Proportion between two values

Once we know how to calculate the proportion of the distribution above or belowany value, we can calculate the proportion between any pair of values

X1 X2

121


Proportion between two values

To calculate the proportion between X1 and X2, calculateThe proportion below X1: P(X < X1)The proportion above X2: P(X > X2)P(X1 < X < X2) = 1− P(X < X1)− P(X > X2)

122


Working backwards: given p find z

We may also wish to work in the opposite directionInstead of asking what proportion of the distribution is above X, we may ask whatis the level such that proportion p of the distribution is above it?For example, given the same distribution, what is the level such that only 5% ofthe distribution is above it?

123


Working backwards: given p find z

Given the same distribution, what is the level such that only 5% of the distributionis above it?We work backwards, starting in the body of the table by searching for the valuenearest to 5% or 0.050This corresponds with z = 1.645 (falls between 1.64 and 1.65)Reverse the formula: X = σ × z + µ = 20× 1.645 + 100 = 132.9Therefore, 5% of this distribution is above 132.9

124

SO5041: Quantitative Research MethodsUnit 8: More on DistributionsOnline apps

App: P<X

Find the proportion below X, (above or below the mean)

Link: Proportion below X

125


App: P>X

Find the proportion above X, (above or below the mean)

Link: Proportion above X

126


App: P between X1 and X2

X1 and X2 may both be on either side of, or straddle the mean.

Link: Proportion between X1 and X2

127

SO5041: Quantitative Research MethodsUnit 9: Sampling Distributions and the Central Limit TheoremOutline

/Unit 9: Sampling Distributions and the Central Limit Theorem /

128

SO5041: Quantitative Research MethodsUnit 9: Sampling Distributions and the Central Limit TheoremOutline

Outline

In this lecture we will examinePopulation parameters and sample estimatesThe Central Limit Theorem: this is why the normal distribution is importantHow to deal with the imprecision of sample estimates: confidence intervalsReading: Agresti and Finlay, Chapter 4

128

SO5041: Quantitative Research MethodsUnit 9: Sampling Distributions and the Central Limit TheoremReasoning about sampling error

Sampling error and sampling distributions

How do we reason about sampling and sampling error?Sampling Error = true value – sample value; but true value unknown!We can reason about the likely distribution of sampling errorIf the true value is µ, how far is a sample value likely to fall from it?−→ a distribution of possible sample values

129


Referendum sample

Example: referendum with yes/no answerThe probability a randomly selected voter goes yes vs no is unknownThat is equivalent to saying the yes/no proportion is unknownWhat if we want to know overall proportion, and poll 1500 random voters?If we get a result of 54/46: how do we know we have any accuracy?

130


Computer simulation

Agresti and Finlay do a computer simulation to testComputer selects 1500 random numbers: 0− 0.5 ⇒ yes 0.5− 1 ⇒ noFirst run gets 51.6% yes, next 49.1%They run it a million times, and make a histogram of the results: vast majoritybetween 0.46 and 0.54, ie +- 4%

131


Simulation by hand: N = 4

We can alternatively work through a “toy” exampleYes and No equally likelySample size of 416 different possible combinations, each equally likely

Result is a “binomial” distributionNote number of possible combinations is 2N so this calculation is practical onlywith tiny samples

132


Simulating with N=4

N

Y

133


Simulating with N=4

N N

N Y

Y N

Y Y

133


Simulating with N=4

N N N

N N Y

N Y N

N Y Y

Y N N

Y N Y

Y Y N

Y Y Y

133


Simulating with N=4

N N N NN N N YN N Y NN N Y YN Y N NN Y N YN Y Y NN Y Y YY N N NY N N YY N Y NY N Y YY Y N NY Y N YY Y Y NY Y Y Y

133


Simulating with N=4

N N N N 0 : 4N N N Y 1 : 3N N Y N 1 : 3N N Y Y 2 : 2N Y N N 1 : 3N Y N Y 2 : 2N Y Y N 2 : 2N Y Y Y 3 : 1Y N N N 1 : 3Y N N Y 2 : 2Y N Y N 2 : 2Y N Y Y 3 : 1Y Y N N 2 : 2Y Y N Y 3 : 1Y Y Y N 3 : 1Y Y Y Y 4 : 0

133


Simulating with N=4

N N N N 0 : 4N N N Y 1 : 3N N Y N 1 : 3N Y N N 1 : 3Y N N N 1 : 3N N Y Y 2 : 2N Y N Y 2 : 2N Y Y N 2 : 2Y N N Y 2 : 2Y N Y N 2 : 2Y Y N N 2 : 2N Y Y Y 3 : 1Y N Y Y 3 : 1Y Y N Y 3 : 1Y Y Y N 3 : 1Y Y Y Y 4 : 0

133


N = 4

05

10152025303540

0 25 50 75 100

Per

cent

age

of s

ampl

es

Predicted Yes Vote

Sampling distribution for N=4 134


Online Apps:

Heads and TailsSimulating binomial sampling

135


Central Limit Theorem

The Central Limit Theorem: for a sufficiently large sample size, the samplingdistribution of a statistic such as the sample mean will be approximately normalThis is true, whatever the distribution of the population of interestWe have seen this with the simulation of sampling vote where the true populationproportions are 50:50

136


Binomial distribution

Population Distribution

0

10

20

30

40

50

60

Yes No

Per

cent

age

Vot

e

Sampling Distribution

PCT

55.00

53.00

51.00

49.00

47.00

45.00

4000

3000

2000

1000

0

Std. Dev = 1.29

Mean = 50.00

N = 10000.00

137


Sampling distribution

This means that for sufficiently large samples, our sample statistic can be regardedas being drawn from a random distribution with mean µ and standard deviationσX = σ√

n

This is a very important theorem because it allows us to reason about the precisionof sample estimatesNote: σX , the standard deviation of the sample mean, is known as the standarderror

138

SO5041: Quantitative Research MethodsUnit 10: Sampling and confidence intervalsEstimating imprecision: confidence intervals

Unit 10: Sampling and confidence intervals

139


Point estimates

Agresti/Finlay: A point estimator of a parameter is a sample statistic that predictsthe value of that parameterFor instance, our sample mean X is a point estimator of the population mean µGood point estimators require two things:

To be centred around the true value (unbiased)To have as small a sampling error as possible (efficient)

139


Unbiased, efficient

Unbiasedness means that the estimate will fall around the true value, “on average”Efficient means that it will fall close to the true valueEstimators such as the sample mean, standard deviation, or sample proportion arereasonably efficient and unbiased

Sample mean:

X =

∑Xi

nSample standard deviation:

σ = s =

√∑(Xi − X )2

n − 1

Sample proportion:π1 =

n1

n1 + n2

140


Sampling distributions and sampling error

We have seen that sample estimates have sampling error, and now understandsomething of its characteristics, by exploring sampling distributionsSample estimates can be considered as being drawn from an imaginary randomdistribution, which (if the estimator is unbiased) will centre on the true populationparameter, with a level of imprecision measured by the standard error (which willbe as low as possible if the estimator is efficient)Where the sample is sufficiently large, the Central Limit Theorem tells us that thesampling distribution is normal, with mean µ and standard deviation σXWe can use this information to add to our point estimate a measure of itsprecision: the Confidence Interval

141


Confidence intervals

Agresti/Finlay: “A confidence interval for a parameter is a range of numbers withinwhich the parameter is believed to fall”“The probability that the confidence interval contains the parameter is called theconfidence coefficient” which is a number close to 1, like 95% or 99%A confidence interval is a band around our point estimate within which we canclaim that there is a, for instance, 95% chance that the true population value lies –how do we calculate this?We work from the sampling distribution – if we are estimating a mean value, weknow that 95% of estimates will fall within ± 1.96 standard errors of the mean

142


Example: transport spending

Assume we have a national sample, and are estimating spending on transport – ifthe true value is €35 per month, with a standard deviation of €5, and the samplesize is 1600, then 95% of all possible sample estimates will fall between

35− 1.96× 5√1600

and35 + 1.96× 5√

1600which is

µ± 1.96× σ√n

The 1.96 comes from the normal distribution: 95% of the distribution is between1.96 standard deviations above and below the mean

143


Reverse the reasoning

We don’t know µ, but we can reverse the reasoning and say that the true valuehas a 95% chance of falling in the range X ± 1.96× σXWe don’t know $ σX $, the standard error either:

σX =σ√n

However, we do know the standard deviation of the sample, s, and this is anunbiased estimator of the population standard deviation, σ, so we can estimate thestandard error:

σX =s√n

144


Calculating the interval

Thus in our example, if our sample mean is €34.75 and our sample standarddeviation is €7.7 we can calculate:

The standard error is 7.7√1600

= 7.740 = 0.1925

The lower bound is 34.75− 1.96× 0.1925 = 34.37The upper bound is 34.75 + 1.96× 0.1925 = 35.13

Thus our point estimate is 34.75 and our 95% confidence interval runs from 34.37to 35.13We can interpret this as saying there is a 95% chance that the true value is in thisrangeThis is because with 95% of all possible samples, such a confidence interval willinclude the true value

145


Levels of confidence

The 95% confidence level is often used but sometimes the chance of being wrongone time in twenty is not acceptable

We can use the normal distribution to choose other levels of confidence: forinstance, 99% of the normal distribution lies within ±2.58 standard deviationsAgain in our example (sample mean is 34.75%, standard deviation 7.7)

The lower bound is 34.75− 2.58× 0.1925 = 34.25The upper bound is 34.75 + 2.58× 0.1925 = 35.25

To be more sure, we are necessarily less precise

146


Standard deviation/error for proportions

The example above is for sample means: the CI for sample proportions is similarThe sampling distribution for a proportion (percent unemployed, percent votingRepublican, percent satisfied etc.) is also normal for large samplesThe standard deviation of a 0/1 or yes/no variable depends on the proportions inthe population however:

σπ =√π × (1− π)

We can estimate the standard error then as

σπ =

√π(1− π)

n

147


CI for proportions

The 95% confidence interval for a proportion uses this, the same was as for themeanIf we have a sample 1000 and find 45% of voters expressing an intention to voteyes we calculate as follows:

σπ =

√0.45(1− 0.45)

1000= 0.0157

and thus the CI is0.45± 1.96× 0.0157

That is π ± 0.031, i.e., plus or minus 3%

148

SO5041: Quantitative Research MethodsUnit 11: Two new distributions: t and χ2

The χ2 distribution

Unit 11: Two new distributions: t and χ2

149



The χ2 test for independence in tables

The χ2 test for association in tablesIndependence: no association between two variables

pattern of row percentages the same in all rowspattern of column percentages the same in all columns

Even if independence holds in the population, sampling variability leads todifferences in percentagesHow big can the differences be before we can be convinced that there is reallyassociation in the population?

149



Compare “observed” with “expected”

Method: compare the real table (“observed”) with hypothetical table underindependence (“expected”)Summarise the difference into a single figure (χ2 statistic, chi-sq)Compare χ2 statistic with known distribution. . .What is the probability of getting a sample statistic “at least this big” by simplesampling variability if independence holds in the population?

150



“Expected” ⇒“independence”

The “expected” table has the same row and column totals, but the cell values aresuch that the percentages are the same as in the total row and column:

nij =RiCj

T

For each cell we summarise the difference between observed (O) and expected (E )values as

(O − E )2

E

The summary for the table as a whole is the sum of this quantity across all cells:

χ2 =∑ (O − E )2

E

151



X 2 has a χ2 distribution

This statistic has a known distribution, the χ2 distributionThat is, if we take a large number of samples from a population where there is noassociation, and calculate the statistic, they will have a distribution in a knownform, and we can calculate the probability of finding a value “at least as large as”any given numberDepends only on the “degrees of freedom”: number of rows minus one times thenumber of columns minus one:

df = (r − 1)(c − 1)

152



Reading the χ2 distribution table

To read the table, we go to the row corresponding to the degrees of freedom, andread across until we get to the column with our chosen probability level (say 0.05)If our χ2 is bigger than the table value, then there is at most one chance in 20 (i.e.,0.05) that it has arisen by sampling variability, if the population has no association

153



χ2 distribution

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 2 4 6 8 10

df 1

df 2

df 3

df 4

df 5

Figure: The χ2 Distribution for various degrees of freedom154



Table of the χ2 distribution (chi-sq)

Values of the χ2 statistic for various degrees of freedom and areas under the right tail.(see full table)

Degrees of Area under right tailFreedom 0.100 0.050 0.025 0.010 0.0051 2.706 3.841 5.024 6.635 7.8792 4.605 5.991 7.378 9.210 10.5973 6.251 7.815 9.348 11.345 12.8384 7.779 9.488 11.143 13.277 14.8605 9.236 11.070 12.833 15.086 16.7506 10.645 12.592 14.449 16.812 18.5487 12.017 14.067 16.013 18.475 20.2788 13.362 15.507 17.535 20.090 21.9559 14.684 16.919 19.023 21.666 23.58910 15.987 18.307 20.483 23.209 25.188

155


The t-distribution

The t-distribution

We have used the Standard Normal Distribution to make confidence intervalsaround sample means and sample proportionsThe depends on the Central Limit Theorem:

A sufficiently large sample drawn from any population will have a normaldistribution, even when the population distribution is not normal

In practice, for small samples the approximation to the normal distribution is notgood, so we use a related distribution called the t-distribution

156


The t-distribution

Student’s t

The t-distribution is like the normal distribution but wider and flatterThe bigger the sample the closer it is to the normal distributionIt makes the CI a little wider for smaller sample sizes

157


The t-distribution

Student’s identity

- It is known as “Student’s t”, after theGuinness statistician who invented it -William Sealey Gosset (1876–1937) -Guinness were wary of losing trade secretsso publication was banned

158


The t-distribution

Degrees of freedom

In practice, we first determine the “degrees of freedom”: this is the sample sizeminus 1Then we choose our probability level – a 95% CI has a probability level of 5% or0.05Then we read off the t-value: this is used in constructing CIs exactly as the z-valuefrom the Standard Normal Distribution

159


The t-distribution

Confidence interval with t

If the sample is large, Central Limit Theroem says sampling distribution is normal:use X ± z × SE for CIIf the sample is small, sampling distribution is wider than normal: use X ± t × SE

160


The t-distribution

Table of the t-distribution (part)

Table of the Student’s t DistributionTwo-tailed probability

Degrees of Probability level (Area under both tails)Freedom 0.10 0.05 0.025 0.01 0.0051 6.314 12.706 25.452 63.657 127.3212 2.920 4.303 6.205 9.925 14.0893 2.353 3.182 4.177 5.841 7.4534 2.132 2.776 3.495 4.604 5.5985 2.015 2.571 3.163 4.032 4.7736 1.943 2.447 2.969 3.707 4.3177 1.895 2.365 2.841 3.499 4.0298 1.860 2.306 2.752 3.355 3.8339 1.833 2.262 2.685 3.250 3.69010 1.812 2.228 2.634 3.169 3.581. . .

(see full table) 161


The t-distribution

A worked example

Example: X = (9, 2, 5, 5, 13, 13, 5, 5, 1, 4)

X = 6.2

s = 4.158

SE = 1.315

95% confidence, 9 degrees of freedom: t0.05,9 = 2.262

Low: 6.2− 2.262× 1.315 = 3.226

High: 6.2 + 2.262× 1.315 = 9.174

162


The t-distribution

A worked example

Example: X = (9, 2, 5, 5, 13, 13, 5, 5, 1, 4)

X = 6.2

s = 4.158

SE = 1.315


Low: 6.2− 2.262× 1.315 = 3.226

High: 6.2 + 2.262× 1.315 = 9.174

162


The t-distribution

A worked example

Example: X = (9, 2, 5, 5, 13, 13, 5, 5, 1, 4)

X = 6.2

s = 4.158

SE = 1.315


Low: 6.2− 2.262× 1.315 = 3.226

High: 6.2 + 2.262× 1.315 = 9.174

162


The t-distribution

A worked example

Example: X = (9, 2, 5, 5, 13, 13, 5, 5, 1, 4)

X = 6.2

s = 4.158

SE = 1.315


Low: 6.2− 2.262× 1.315 = 3.226

High: 6.2 + 2.262× 1.315 = 9.174

162


The t-distribution

A worked example

Example: X = (9, 2, 5, 5, 13, 13, 5, 5, 1, 4)

X = 6.2

s = 4.158

SE = 1.315


Low: 6.2− 2.262× 1.315 = 3.226

High: 6.2 + 2.262× 1.315 = 9.174

162


The t-distribution

A worked example

Example: X = (9, 2, 5, 5, 13, 13, 5, 5, 1, 4)

X = 6.2

s = 4.158

SE = 1.315


Low: 6.2− 2.262× 1.315 = 3.226

High: 6.2 + 2.262× 1.315 = 9.174

162


The t-distribution

A worked example

Example: X = (9, 2, 5, 5, 13, 13, 5, 5, 1, 4)

X = 6.2

s = 4.158

SE = 1.315


Low: 6.2− 2.262× 1.315 = 3.226

High: 6.2 + 2.262× 1.315 = 9.174

162

SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignSurveys and Questionnaires

Unit 12: Questionnaire Design

163


Surveys and Questionnaires

The main way of collecting survey data is by means of questionnaires andstructured interviewsThe questionnaire used in structured interviewing is sometimes referred to as theinterview scheduleQuestionnaires filled out by the respondent him/herself are referred to asself-completion questionnairesThere are several important issues relating to the design and implementation ofquestionnaire-based surveysA good reference is DA de Vaus, Surveys in Social Research, chapter 6(“Constructing Questionnaires”)

163


Purpose of questionnaire

The practical purpose of a questionnaire survey is to get standard info from a largenumber of people relatively cheaply and reliablyThis requires a minimum of ambiguity in the meaning of the recorded answers –need to be clear and preciseIt also requires a minimum on non-response

Item non-response: refusal to answer or “don’t know” to specific questionsRespondent refusal: flat refusal to participate

Hence good design is important:Clarity of questions and structureNo unnecessary burden on the respondent

164

SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignMeasurement Theory

Measurement Theory

A questionnaire is a data collection instrumentThis implies that what is measured must be standardised: the same questionsasked of every respondent, in the same way, in order that the data recorded meansthe same thing – standardisation is what allows the use of computers (andstatistics) to analyse itThis is related to Reliability: the instrument will get the same results in the samecircumstances

165


Validity

Validity is also important: an instrument (e.g., a set of questions) is valid insofaras it measures what it is intended to measureAn instrument can be valid without being reliable, and vice versaFor instance, a set of questions designed to assess right-wing attitudes may bereliable (provide consistent results) but not valid (if scoring high is not verystrongly associated with being right-wing, etc.)

166


Reliability, comparison, validity

Reliability is very important for comparison: without it we cannot compare cases orattribute differences between cases to their other characteristics – reliability isessential for analysisWithout validity, there is no point to analysis – you’re not studying what you thinkyou’re studying

167

SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignQuestionnaire Design

Questionnaire Design

There are two aspects to questionnaire designThe overall structure of the documentThe individual questions

168


Composing questions

The questions themselves must be clear and unambiguousExpress the questions in clear and simple languageDon’t use leading questions (Avoid “Isn’t it the case that X is a good idea?”; prefer“Do you think X is a good idea or a bad idea?”)Ask a single thing at a time (Avoid “Do you have a job, and if so how much do youearn?”)Avoid hypothetical questions: often don’t give useful information (e.g., “If you had agrant how much more money would you spend on drink?”)If you need to ask for information about others, restrict it to factual information(“What is your husband’s job?”, but not “What does your husband think about X?”)Make the questions easy to answer:

provide a set of options (perhaps on a prompt card)with amounts (time, money etc.) provide options help to precision (e.g., times perweek, ranges)

169


Document structure

The document must be structured simply and logicallyGroup questions in a way that will seem reasonable to the respondentUse clear routing:

5: Do you have a job?(if yes, ask qn 6, else skip to qn 7)6: What is your job?7: . . .

Reserve sensitive questions (e.g., income, drug use) to the end of thequestionnaire: less likely to scare off respondents there

170


Question structure

The structure of questions is also importantEach question must have its answer space: a tick-box, a space for writing, a set ofcategoriesClosed questions are preferred: a fixed set of options makes it easier to answer andto analyseAlways allow the category “Other” with closed questions, with space to write (“Ifother, please specify:”)

171


Types of closed question

Closed responses can take several forms:One only: “Tick the category that best describes your job”One or more: “Tick all of the following reasons that are relevant”Ranking: “Rank the importance for you of each of the following reasons (1, 2, 3etc.)”

172


Likert scales

Measurement of attitudes often uses “Likert” scales:a set of statements relevant to the attitude being measuredwith options indicating agreement:

Strongly disagreeDisagreeNeither agree nor disagreeAgreeStrongly agree

173


Practical organisation

The practical organisation of the questionnaire must take question structure intoaccount: easy to answer and easy to processProcessing involves several steps

Recording the information as it is being collectedChecking it is consistent and the right questions are answeredCoding it onto a computerChecking it is consistent once on computerAdding variable and value labels to help the computer data set to make sense

174


Design principles

The design of the questionnaire should keep the first three of these in mindIt should be easy to record the informationIt should be easy to read the recorded information to see that it makes sense, to seethat the correct routing has been followedThe layout of the questionnaire should anticipate the structure of the computer dataset

175


Example questionnaire

Example Questionnaire Extract Office use only1: Sex: Male 1 Female 2 q1

2: Age: 18 or under 1

19 – 23 2

24 – 30 3

31 – 40 4

41 – 50 5

51 – 64 6

65 or more 7

q2

176


Many options, tick one

3: Which of the following options best describes your current situation? (show card):Self-employed 1

Employed 2

Unemployed 3

Retired 4

Family care 6

Full time student, school 7

Long-term sick, disabled 8

Training scheme 9

Other 10

q3

If “Other” please specify:

177


Ordinal options (with a flaw)

4: How often do you read newspapers? Tick the category that is closest:Never 1 (Skip to question 6)More than 1 per day 2

1 every day 3

2–4 per week 4

1 per week 5

Less than 1 per week 6

q4

178


Picking zero or more from a list

5: Which of the following newspapers do you read? (show card) Tick each one thatapplies:Irish Times a q5a

Irish Independent b q5b

Irish Examiner c q5c

Limerick Leader d q5d

Sunday Independent e q5e

etc. etc. f q5f

Other g q5g

If “Other” please specify:

179


Attitude questions with Likert answers

6: People have different views about women’s role in society. Please indicate whetheryou agree or disagree with each of the following things people might say, using thecategories provided:

Stronglyagree

Agree Neitheragree nordisagree

Disagree Stronglydisagree

A woman’s place is in the home 1 2 3 4 5 q6a

Young children suffer if the motherworks outside the home

1 2 3 4 5 q6b

Women are entitled to the samepay for the same work as men

1 2 3 4 5 q6c

When a wife works it is importantfor the husband to do his share ofthe housework

1 2 3 4 5 q6d

180


Pilot your questionnaire

It is very important that a questionnaire work well once in the fieldTest it beforehand,

on friends/colleagueson a “pilot” subsample drawn from the reference population

Are the questions comprehensible?Are the categories in closed questions adequate? Do they cover everything? Makethe right distinctions? No very unlikely ones? Will they make sense to therespondent?

181


Piloting

Is the structure okay? Not too complicated or illogical?Is there anything likely to confuse the respondent or interviewer?Exactly how long will it take? If too long, shorten it now!Piloting will also allow you to create closed categories for questions, usingreal-world feedback rather than your imagination

182

SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignDistribution and collection

Distribution and collection

How to choose a sample? Depends on research question and resourcesTypes of sample

Random sample from explicit sample frameMulti-stage random sampleQuota sampleAccidental/convenience sample

183


Samples

The first two are the most powerful – true random sample – and it is enormouslyimportant to get as many of the sample to respond – each individual countsWhen it is not possible to generate a sampling frame, quota or conveniencesamples may be worthwhileWith quota sampling, representativity is “enforced” by quota, and the emphasis ison filling the quotas as efficiently as possible

184


Convenience and snowball

With convenience sampling (e.g., snowball sampling), representativity isn’t possiblein the same way, so more emphasis on getting enough people, though still veryimportant to minimise refusals (very likely, people who refuse are different in someinteresting way)With quota and convenience sampling, also very important to make surerespondents fit the requirements – quite formal for quota, related to the researchproblem definition for convenience sampling

185


Accessing respondents

With a sample from a frame there is a clear identification of exactly whom tointerview, often with addresses – the work is then in arranging times suitable tothe respondent (and in a large survey, dividing respondents among interviewers)With quota or convenience sampling, some other strategy has to be found tocontact potential respondents, e.g., random telephone dialling or stopping peoplein the street for quota sampling; snowball sampling (asking respondents to tell youabout other people you could interview), working through groups or organisationsor locations etc. for convenience sampling

186

SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignModes of delivery

Modes of delivery

How a questionnaire is delivered has consequences for how it is answered and whoanswers itWe can distinguish between face-to-face interviews and self-completionquestionnaire

187


Face to face

Face to face (perhaps not literally) involve an interviewer mediating the questionsand the responses – e.g., s/he reads the questions, provides prompts andclarifications, and writes down the answers

Can be carried out by the researcheror a delegated interviewerCan be carried out on the phone

188


Interviewer delivered

Questionnaires applied by an interviewer have certain advantages:Interviewer can clarify questionsCan also clarify ambiguous answersMore likely to get a complete responseMore likely to get a response in the first place (harder to refuse a person, easy toignore a piece of paper)

And disadvantages:Very costlyCertain sorts of information may be suppressed in a face-to-face interviewer that maybe written downAnswers possibly more likely to be slanted towards an expectation of what theinterviewer would like to hear

189


Self completion

Self-completion questionnaires are completed by the respondent, perhaps undersupervision or perhaps in his/her own timeThey can be distributed to a sample (e.g., pupils in schools, workers inorganisations, by post to addresses etc.)They can also be distributed in an “opportunistic” way: handed out at an event(e.g., after mass, at a concert) or left in a public place (perhaps with somepublicity or an endorsement by someone “in authority” – e.g., the priest brings thecongregation’s attention to the fact that they are there)

190


Response rate

Response rate is a big problem with self completionWith distribution to specific people (i.e., sample, though non-random) make surethere is a covering letter to the individual indicating that the research is valuable andthe data will be confidentialFollow up with reminders some time later to those who haven’t repliedWith “opportunistic” distribution, make sure the questionnaire makes the researchseem valuable, and guarantee confidentialityWith opportunistic distribution, you can’t issue personal reminders (ask the priest toremind the congregation a couple of weeks later, etc.)

191


Advantages and disadvantages

The advantages of self-completion aremore privacy for the respondentcheap and easy

The disadvantages are manyMore possibilities for misunderstandingNo control over meaningless, ambiguous or illegible answersMuch harder to control response rate: will be lowMore room for response bias to enter: only those with an axe to grind will botherreplying (in the extreme case) – for instance, people who have experienced sexualharassment in the workplace are far more likely to respond to a relevant survey thanthose who have not, so don’t use such a technique to estimate the prevalence ofsexual harassment

192

SO5041: Quantitative Research MethodsUnit 13: Hypothesis testing

Unit 13: Hypothesis testing

193

SO5041: Quantitative Research MethodsUnit 13: Hypothesis testing

Topics

Three important topics today:Hypothesis testingSignificancet-test for paired samples

193

SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingHypothesis Testing

Confidence intervals assess imprecision

Confidence intervals allow us to present sample information appropriatelyPoint estimate, e.g., mean or other sample statistic: our “best guess” of the truevalueConfidence interval: range in which we are “confident” true value (the populationparameter) lies

In combination, the point estimate and CI give us the answer with a measure of itsprecision

194


Hypothesis testing and CIs

Statistical inference proceeds by hypothesis testing: a formalisation of theConfidence Interval approachEasier to disprove something than to prove something trueThus if we wish to test if a variable has an effect on another (e.g., does a switchto a 4-day week change productivity) we set up a hypothesis, for instance

H1: Productivity after is not the same as productivity before, Xa 6= Xb

195


Recent example

Microsoft recently experimented with a 4-day week in Tokyo:

https://www.theguardian.com/technology/2019/nov/04/microsoft-japan-four-day-work-week-productivity

They found productivity increased, as well as worker satisfaction

196


Null hypothesis

To test this turn it around, and set up a null hypothesis that says the opposite:

H0: Productivity after is equal to productivity before, Xa = Xb

If we can reject the null hypothesis on the basis of our sample data, we can say thedata support the main hypothesis

197


Claims about population

Hypothesis testing is a way of using the reasoning behind CIs to make specificclaims about the populationSay we want to know if there is a relationship between two variables, e.g., whetherswitching to a 4-day week has an effect on productivity (positive or negative)We look at a sample of workers to make inferences about the populationWe begin with a hypothesis: Xa 6= Xb or Xa > Xb

We negate the hypothesis to form a null hypothesis, called H0: H0 : Xa = Xb

which is equivalent to H0 : Dx = Xa − Xb = 0That is, on average, the productivity difference is zero

198


Testing the null hypothesis

We then test the null hypothesis:First we calculate a sample mean productivity difference, Dx

Then we construct a confidence interval at our chosen level of confidence (e.g.,95%): Dx ± z0.95 × SEThe CI gives us a band around the point estimate within which we are 95% sure thetrue value liesIf zero lies outside the interval, we are at least 95% sure the true population value isnot zero, and we can reject the null hypothesisIf zero lies within the interval, then zero is in the range of plausible values, so wecannot reject the null hypothesisWe don’t say we "accept the null hypothesis" because zero is just in the range ofplausible values, and other values in this range are approximately as likely

199


Reject or fail to reject

Rejecting the null hypothesis constitutes support for the initial or “alternative”hypothesisFailing to reject the null hypothesis means the data fail to support the initialhypothesis: “there is no evidence that the switch to a 4-day week affectsproductivity”Failure to support the initial hypothesis may be because

It is actually false, i.e., Dx = 0The effect is small and/or very variable, and thus the sample is too small to detect it

200


Example: Productivity before and after an intervention

ID Before After1 3.5 4.62 1.8 2.33 2.4 2.54 3.3 4.45 1.7 2.16 3.7 5.17 4.4 5.68 3.4 4.19 1.8 1.4

10 2.3 1.7

201


How do we “test” the null hypothesis?

In this example we calculate the difference in productivity before and after, in thesample dataSome may be negative, some may be positive, but we are interested in theaverage: is it systematically different from zero?Strategy: calculate the mean of the differences, and construct a CI around it (sayat 95%)If zero lies outside the CI, then we are at least 95% sure the true value lies in arange that does not include zeroIf zero within the CI, then the range within which we think the true value lies doesinclude zero

202


Reject or not?

In the former case (zero outside interval), we can reject the null hypothesis: we areat least 95% sure that zero is not the true valueWe can therefore say the data support the initial hypothesis that the switch to a4-day week affects productivityIn the latter case (interval contains zero), we cannot reject the null hypothesis:zero is in the range where we feel the true value liesIn this case there is no evidence in the sample data of an effect of the 4-day weekon productivityThis is not the same as evidence of no effect!That zero lies within the CI is not the same as zero being the true value!

203

SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingStatistical significance

Statistical significance

Significance: let’s say we do a hypothesis test with a 95% confidence level, and wefind the zero is way outside the CIWe can try again with a 99% confidence level:

If it is still outside the interval we are not “at least 95%” but “at least 99%” surethat zero is not the true value

If we keep trying with CIs with higher confidence levels we will eventually find onewhere zero is just outside the intervalIf that is at confidence level C we can say that we are C% sure (not “at least” anymore) that zero is not the true value

204


The chance of being wrong?

There is then a C% chance that the true value is in the range that doesn’t includezero, or a 100% - C% chance that the true value is outside the CI, and thereforecould include zeroThis p = 100%-C% value is the probability that we get a sample statistic asdifferent from zero as we did, even though the true value was zeroThis is known as the significance of the sample estimate, or its p-valueWe want it to be as small as possible, typically under 5% (0.05)p-values are widely used – stats programs report them in many placesIn general the interpretation is “what’s the probability of getting this result bychance if the null hypothesis was true?”

205


App: confidence intervals and significance

http://teaching.sociology.ul.ie:3838/apps/cislide

206


t-test

Rather than use the CI we can set this up as a “t-test”We can find the t corresponding to the CI just touching zero thus:

t =X

SE

If that t is larger than the critical value, then the CI using the critical value issmaller and doesn’t overlap zeroThe significance is the exact p-value of that tThis example is a “paired sample t-test”

207


t-test example in Stata

. gen diff = after-before

. ttest diff == 0One-sample t test

Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

diff 10 .5499999 .2166667 .6851601 .0598659 1.040134

mean = mean(diff) t = 2.5385Ho: mean = 0 degrees of freedom = 9

Ha: mean < 0 Ha: mean != 0 Ha: mean > 0Pr(T < t) = 0.9841 Pr(|T| > |t|) = 0.0318 Pr(T > t) = 0.0159

208


t-test – paired

. ttest before==afterPaired t test

Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

before 10 2.83 .299648 .94757 2.152149 3.507851after 10 3.38 .4859812 1.536808 2.280634 4.479366

diff 10 -.5499999 .2166667 .6851601 -1.040134 -.0598659

mean(diff) = mean(before - after) t = -2.5385Ho: mean(diff) = 0 degrees of freedom = 9Ha: mean(diff) < 0 Ha: mean(diff) != 0 Ha: mean(diff) > 0Pr(T < t) = 0.0159 Pr(|T| > |t|) = 0.0318 Pr(T > t) = 0.9841

209


χ2 test and significance

Another example of significance occurs in the χ2 (chi-sq) test for association in atableHere the initial hypothesis is that the two variables are associatedThus the null hypothesis is that they are not associated (the “independencehypothesis”)

When we calculate the χ2 statistic (∑ (O−E)2

E ) we compare its value with therange of possible values we would get if H0 were trueThis is what we read from the table of the χ2 distribution

210


χ2 example in Stata

. tab race region, chirace of region of the united states

respondent north eas south eas west Total

white 582 307 375 1,264black 82 94 28 204other 15 14 20 49

Total 679 415 423 1,517Pearson chi2(4) = 53.1683 Pr = 0.000

211


Signficance and error

Another way of looking at significance is “the chance we would be wrong if webelieve the initial hypothesis”For instance, if there is one chance in twenty (p = 0.05) that the true value isoutside the CI, then by basing our decision on the CI we will be wrong one time intwentyThis is known as Type I Error: rejecting the null hypothesis when it is true

e.g., the true value might be zero but a small number of possible samples generateCIs that don’t include zeroe.g., there may be no association but a small number of possible samples yield highχ2 statistics

212


Type I and Type II error

If very important to avoid Type I error,use high confidence levels (e.g., 99.5%instead of 95%) or insist on lowp-values (e.g., 0.005 instead of 0.05)However, there is a second type oferror, Type II

Failing to reject the null hypothesiswhen it is false

That is, failing to support the initialhypothesis even though it is true

213


Type I and Type II

If we raise the confidence level we reduce the risk of Type I error but raise the riskof Type II errorThat is, if we make a special effort not to accept an initial hypothesis unless thereis very clear evidence, we necessarily fail to accept it where there is only fairly clearevidenceFor a given p-value, we can only reduce the Type II error by increasing the samplesize

214

SO5041: Quantitative Research MethodsUnit 14: More $t$-tests

Unit 14: More $t$-tests

215


The basic t-test: sample versus reference-point

The simplest t-test compares a sample mean against a fixed number:

H0 : µ = Xr

t =X − Xr

SE

If t is bigger than the critical value for 95%, or its p-value is below 0.05, we rejectthe null hypothesis

215


Paired-sample t-test

Comparing “paired samples” is the same, with the diffence being compared withzero:

D = Xafter − Xbefore

H0 : δ = 0

t =D − 0SE

216


“t-test” for a proportion

Comparing a sample proportion against a fixed number such as 50% has a similarlogicIt doesn’t use the t-distribution, but can use standard normal for “large” samples

H0 : π = πr

z =p − πrSE

217


Directional hypotheses

A complication: some hypotheses are directionale.g., holidays make you happier, training raises your earning power

H1 : Wafter >Wbefore

Ho : Wafter ≤Wbefore

Ho : D ≤ 0

218


Direction ⇒ one-way t-test

If zero is below the CI, reject the null hypothesisIf zero is within or above the CI, cannot rejectNet result: higher confidence for the same CI – 1.96× SE for 97.5% not 95%

219

SO5041: Quantitative Research MethodsUnit 14: More $t$-testsComparing means across groups

Comparing means across groups

We don’t always have situations where we want to test something as simple aswhether the true answer is zeroHowever, we very often want to test whether a mean (e.g., income) is differentaccording to values of another variable (e.g., sex)If sex affects wage, we would expect to see the mean wage for men (Xm) to bedifferent from the mean wage for women (Xw )We can consider the sample difference (Xm − Xw ) to be a point estimate of thepopulation differenceThe null hypothesis is that Xm = Xw or Xm − Xw = 0

220


Independent-samples t-test

To construct a CI we need the SE, which is very like the normal one if both groupshave the same population variance (or standard deviation):

√s2

n1+

s2

n2= s

√1n1

+1n2

If this cannot be assumed, the SE is more complex, and depends on the separatestandard deviations

√s21n1

+s22n2

221


Two-sample t-test in Stata

. ttest grsearn, by(sex)Two-sample t test with equal variances

Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

male 953 25.89192 1.584105 48.90243 22.78318 29.00066female 978 27.92025 1.541657 48.21223 24.8949 30.94559

combined 1,931 26.91921 1.104885 48.5521 24.75232 29.08611

diff -2.028325 2.210045 -6.362652 2.306002

diff = mean(male) - mean(female) t = -0.9178Ho: diff = 0 degrees of freedom = 1929

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0Pr(T < t) = 0.1794 Pr(|T| > |t|) = 0.3589 Pr(T > t) = 0.8206

222


Two-sample t-test, unequal variance

. ttest grsearn, by(sex) unequalTwo-sample t test with unequal variances

Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

male 953 25.89192 1.584105 48.90243 22.78318 29.00066female 978 27.92025 1.541657 48.21223 24.8949 30.94559

combined 1,931 26.91921 1.104885 48.5521 24.75232 29.08611

diff -2.028325 2.210451 -6.363455 2.306805

diff = mean(male) - mean(female) t = -0.9176Ho: diff = 0 Satterthwaite´s degrees of freedom = 1925.9

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0Pr(T < t) = 0.1795 Pr(|T| > |t|) = 0.3589 Pr(T > t) = 0.8205

223


Key concepts

Hypothesis testNull hypothesisInitial or alternative hypothesis

Types I and II errorStatistical significance and p-valuest-tests:

Single value compared with referencePaired values compared with implicit zero referenceIndependent samples t-test: comparing two groups

Equal vs unequal variance

224

SO5041: Quantitative Research MethodsUnit 14: More $t$-testsSummarising inference

Summarising inference

For large samples, we use the normal distribution to construct confidence intervalsaround means of “quantitative” (interval/ratio) variablesFor small samples we use the t-distribution to construct the confidence intervalFor interval/ratio variables we usually estimate the mean, and use

SE =σ√n

=

√∑(X−X )2

n−1√n

225


Summarising inference: proportions

For nominal variables like vote, sex, etc. we calculate proportions, not means(where we split in two)With large samples (at least 20 in each category) we can construct confidence

intervals using the normal distribution and the formula σπ =√

p(1−p)n for the

standard errorWith this we can conduct hypothesis tests in exactly the same way as withinterval/ratio dataHowever, with small samples the approximation to the normal distribution nolonger holds and we have to use another distribution, the binomial distribution

226


Summarising inference: χ2 test

For nominal variables with more categories, and for tables made from nominalvariables, we can use the χ2 testAgain, this has “large sample” requirements – none of the expected values shouldbe < 5

227

SO5041: Quantitative Research MethodsUnit 15: Correlation

Unit 15: Correlation

228

SO5041: Quantitative Research MethodsUnit 15: Correlation

Remaining topics

From today we look at two related techniques for interval and ratio variables:Correlation: a single measure of association for interval/ratio variablesLinear Regression: a very powerful technique for describing the relationship betweenone interval/ratio variable and another

228

SO5041: Quantitative Research MethodsUnit 15: CorrelationMeasures of association for interval/ratio variables

Scatterplots

Scatterplots can be considered as interval/ratio analogue of cross-tabs: arbitrarilymany values mapped out in 2-dimensions

*Figure 1:* Age 47, income €6,535.

0

2000

4000

6000

8000

10000

0 20 40 60 80 100

Inco

me

Age

47 units from left

6535 unitsfrom bottom

229


Age vs Income

*Figure 2:* Age and income, Wave 1 BHPS.

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

0 20 40 60 80 100

Inco

me

Age 230


Summarising association simply

We can see a lot of detail in a scatterplot, but sometimes we can summarise it insimple waysFor instance the two variables may have a positive association: when one is highthe other tends to be high, and vice versaOr a negative association: when one is high the other tends to be low

231


Strong positive

*Figure 3:* Fictional data displaying a strong *positive* linear relationship.

5

10

15

20

25

30

35

0 1 2 3 4 5 6 7 8 9

Y v

aria

ble

X variable

232


Strong negative

*Figure 4:* Fictional data displaying a strong *negative* linear relationship.

0

5

10

15

20

25

0 1 2 3 4 5 6 7 8 9 10

Y v

aria

ble

X variable

233


Weak negative

*Figure 5:* Fictional data displaying a *weak* negative linear relationship.

-10

-5

0

5

10

15

20

25

30

0 1 2 3 4 5 6 7 8 9 10

Y v

aria

ble

X variable

234


No relationship

*Figure 6:* Fictional data displaying absence of a relationship.

0

2

4

6

8

10

12

14

16

18

20

0 1 2 3 4 5 6 7 8 9 10

Y v

aria

ble

X variable

0

2

4

6

8

10

0 5 10 15 20

Y v

aria

ble

X variable

235


App

http://teaching.sociology.ul.ie:3838/so5041/corr

236


The correlation coefficient

How does it work?Combines X deviations (Xi − X ) and Y deviations (Yi − Y ) – i.e., compares eachpoint with the mean for X and the mean for YWith positive association, cases below average on X tend to be below average onY (and above average on X tend to be above average on Y)With negative association, cases below average on X tend to be above average onY and vice versaWith positive association below the mean, both Xi − X and Yi − Y are negative,so (Xi − X )(Yi − Y ) is positiveWith negative association, Xi − X and Yi − Y tend to have opposite signs, so(Xi − X )(Yi − Y ) is negative

237


Deviations

Figure 7: X and Y deviations for correlation

X − X_

Y − Y_

X

Y

X

Y_

_

238


Pearson Product-Moment Correlation

CoefficientPearson Product-Moment Correlation Coefficient (r)

r =SXY√

SXX .SYY

SXX = Σ(X − X )2

SYY = Σ(Y − Y )2

SXY = Σ(X − X )(Y − Y )

Range: $ -1 ≤ r ≤ +1 $r is a symmetric measure: rxy = ryx

239


Pitfalls

Correlation is not causalityAbsence of correlation is not absence of relationship: non-linearity

240


Non-linear!

*Figure 8: * Fictional data displaying a strong *non-linear* relationship and a near-zerocorrelation.

-2

0

2

4

6

8

10

12

-3 -2 -1 0 1 2 3

t, t**2+invnorm(rand(1))

241


Estimating correlations in Stata

. pwcorr ofimn ojbhgs ojbhrs, sigofimn ojbhgs ojbhrs

ofimn 1.0000

ojbhgs 0.4851 1.00000.0000

ojbhrs 0.4245 0.2489 1.00000.0000 0.0000

242


Viewing the same correlations

243

SO5041: Quantitative Research MethodsUnit 16: Regression

Unit 16: Regression

244


Regression analysis

Regression Analysis: Fitting the “best” line through the scatterVery closely related to correlation, but treats one variable as dependent and theother(s) as explanatory, while correlation is asymmetric

244


Some geometry: equation of a line

Y = a + bX

0

0.5

1

1.5

2

0 0.5 1 1.5 2 2.5 3

Y v

aria

ble

X variable

X = 0; Y = 0.7

X = 1; Y = 1.0

Y = 0.7 + 0.3*X

245


Predictive in intent

Asymmetric: use X to predict YPRE: implied causality(Proportional Reduction in Error: if we know X, we can guess Y better)Find the best a and b to summarise the data scatter:

‘Best’ is defined as minimising the squared deviations between the observeddata-points and the fitted line, hence often called ‘least-squares’ regressionDeviations are the vertical distance between the line and the observed data points.Very similar logic to the mean (minimise variance).

246


Predicted values

The line gives a predicted value of Y for each value of X :

Y = a + bX

Y = Y + e

Y = a + bX + e

e is the ‘residual’ or deviation.

That is, knowing X we “predict” or guess Y as a + bX

In general this is more accurate that guessing Y as Y , the mean: “ProportionateReduction in Error”

247


Deviations from the line

248


Regression equation

Regression equation: the estimate of Y , called Y , depends on X :

Y = a + bX

The regression slope b depends on SXY and SXX , the intercept a is calculatedfrom b and the mean values of Y and X :

b =SXY

SXX

a = Y − bX

SXX =∑

(Xi − X )2

SXY =∑

(Xi − X )(Yi − Y )

249


Pitfalls

Like correlation, non-linear relationships may be missedSpurious relationships will fit just as well as real ones (e.g., if A affects B and Aaffects C, B and C will seem to be related and a regression line might fit well)Predicting outside the range of the data: the relationship we see only holds for thedata we use, and it may well not hold for higher (or lower) values of X and Y

250


Fit

How well does it “fit”? We use R2 to tell:ranges from 0: no relationship at allto 1: perfect relationship, all Y s are exactly equal to a + bXvalues from 0.7 up indicate quite a good relationshipsmaller values may indicate an interesting relationship

In the case of bivariate regression (one independent variable), R2 is the same asr × r (squared correlation coefficient).

251


Regression in Stata:

. reg ofimn ojbhrsSource SS df MS Number of obs = 7,945

F(1, 7943) = 1398.95Model 1.7000e+09 1 1.7000e+09 Prob > F = 0.0000

Residual 9.6522e+09 7,943 1215179.2 R-squared = 0.1497Adj R-squared = 0.1496

Total 1.1352e+10 7,944 1429021.17 Root MSE = 1102.4

ofimn Coef. Std. Err. t P>|t| [95% Conf. Interval]

ojbhrs 39.34202 1.051854 37.40 0.000 37.28011 41.40393_cons 434.7389 36.8029 11.81 0.000 362.5955 506.8822

252


Predicted regression line

253

SO5041: Quantitative Research MethodsUnit 16: RegressionMultiple Regression

Multiple explanatory variables

Regression analysis can be extended to the case where there is more than oneexplanatory variable – multivariate regressionThis allows us to estimate the net simultaneous effect of many variables, and thusto begin to disentangle more complex relationshipsInterpretation is relatively easy: each variable gets its own slope coefficient,standard error and significanceThe slope coefficient is the effect on the dependent variable of a 1 unit change inthe explanatory variable, while taking account of the other variables

254


Example

Example: domestic work time may be affected by gender, and also by paid worktime: competing explanations – one or the other, or both could have effectsWe can fit bivariate regressions:

DWT = a + b × PaidWork

orDWT = a + b × Female

We can also fit a single multivariate regression

DWT = a + b × PaidWork + c × Female

255


Dichotomous variables

We deal with gender in a special way: this is a binary or dichotomous variable –has two valuesWe turn it into a yes/no or 0/1 variable – e.g., female or notIf we put this in as an explanatory variable a one unit change in the explanatoryvariable is the difference between being male and femaleThus the c coefficient we get in the DWT = a + b × PaidWork + c × Femaleregression is the net change in predicted domestic work time for females, once youtake account of paid work time.The b coefficient is then the net effect of a unit change in paid work time, onceyou take gender into account.

256


Sex only predicting income

. reg ofimn i.osexSource SS df MS Number of obs = 7,945

F(1, 7943) = 606.72Model 805586626 1 805586626 Prob > F = 0.0000


Total 1.1352e+10 7,944 1429021.17 Root MSE = 1152.3


osexfemale -637.3352 25.87467 -24.63 0.000 -688.0563 -586.614

_cons 2062.275 18.64855 110.59 0.000 2025.719 2098.831

257


Sex and job hours predicting income

. reg ofimn ojbhrs i.osexSource SS df MS Number of obs = 7,945

F(2, 7942) = 794.96Model 1.8935e+09 2 946761687 Prob > F = 0.0000


Total 1.1352e+10 7,944 1429021.17 Root MSE = 1091.3


ojbhrs 33.96065 1.123629 30.22 0.000 31.75804 36.16326

osexfemale -337.0889 26.44232 -12.75 0.000 -388.9228 -285.255

_cons 787.1759 45.73595 17.21 0.000 697.5214 876.8304

258


Sex and hours

259

SO5041: Quantitative Research Methodsteaching.sociology.ul.ie/so5041/slides8.pdf · SO5041:...

Documents

Transcript of SO5041: Quantitative Research Methodsteaching.sociology.ul.ie/so5041/slides8.pdf · SO5041:...