SO5041: Quantitative Research Methodsteaching.sociology.ul.ie/so5041/slides8.pdf · SO5041:...
Transcript of SO5041: Quantitative Research Methodsteaching.sociology.ul.ie/so5041/slides8.pdf · SO5041:...
SO5041: Quantitative Research Methods
SO5041: Quantitative Research Methods
Brendan Halpin, Sociology
Autumn 2019/0
1
SO5041: Quantitative Research MethodsUnit 0: Course OutlineCourse outline
Unit 0: Course Outline
2
SO5041: Quantitative Research MethodsUnit 0: Course OutlineCourse outline
Course outline
Lecture: CG058 Mon 10h00-12h00PC Lab: A00600a Mon 13h00-15h00Lecturer: Dr Brendan Halpin
[email protected] 3147Office hours: Tuesday 14h00–17h00, F1-007 (or e-mail for appointment)
2
SO5041: Quantitative Research MethodsUnit 0: Course OutlineCourse outline
Course focus:
The role of empirical reasoning in sociology, especially using quantitative dataQuantitative social science data collection, especially the surveyHandling quantitative data: coding it onto a computer, organising it, presenting itStatistical analysis: making claims about the world using quantitative data –sampling and inference
3
SO5041: Quantitative Research MethodsUnit 0: Course OutlineCourse outline
Practical focus:
Using software to analyse data and prepare findingsStata: statistical software packageMicrosoft Excel: spreadsheet
Carrying out questionnaire-based researchBecoming a critical consumer of quantitative research
4
SO5041: Quantitative Research MethodsUnit 0: Course OutlineCourse outline
Planned outline
Introduction to quantitative method – using number to represent information,simple descriptive statistics and presentations. Using Stata to enter data andreport itSamples, surveys and probability – theory of how a sample can be used to describea population, elementary probability theory, basic questionnaire design and surveyimplementation. Manipulating data in Stata, more presentation.Statistical inference – the practice of using a sample to describe a population.Statistically informed use of Stata: testing difference of means, analysingassociation in tables.Linear regression and correlationRegression analysis with multiple explanatory variables?In parallel, reading and discussion of a small number of quantitative researchreports
5
SO5041: Quantitative Research MethodsUnit 0: Course OutlineCourse outline
Assessment
Assessment: three assignments (due ends of weeks 5, 9 and 13) totalling 60%Final exam 40%Repeat assessment: All assignments plus repeat exam August 2020
6
SO5041: Quantitative Research MethodsUnit 0: Course OutlineCourse outline
Detailed lecture sequence
Week 1 General introduction
Week 2 (1) Introducing Quantitative Social Research, (2) Surveys, Questionnaires and Sampling
Week 3 (3) Numbers as Information – Data, (4) Bivariate analyses
Week 4 (4) Bivariate analyses continuted, (5) More on types of variables
Week 5 (6) Sampling, (7) Distributions
Week 6 (8) More on Distributions, (9) Sampling Distributions and the Central Limit Theorem
Week 7 (10) Sampling and confidence intervals, (11) Two new distributions: t and χ2
Week 8 Bank Holiday
Week 9 (12) Questionnaire Design
Week 10 (13) Hypothesis testing, (14) More $t$-tests,
Week 11 (15) Correlation, (16) Regression
Week 12 (16) Regression continued, (17) Review
7
SO5041: Quantitative Research MethodsUnit 0: Course OutlineCourse outline
Lab sequence (subject to change)
Week 1 Logging on, running Stata, general intro (short lab)
Week 2 Simple data entry, plus some univariate analysis
Week 3 Bivariate analysis
Week 4 Editing data in Stata, missing values
Week 5 Using the Normal Distribution (by hand, spreadsheet)
Week 6 Confidence intervals for means and proportions (by hand, Stata)
Week 7 Understanding the standard deviation (spreadsheet exercises)
Week 8 Bank Holiday
Week 9 Chi-squared test (spreadsheet and Stata)
Week 10 Hypothesis test on a mean (spreadsheet and Stata)
Week 11 More hypothesis tests; correlation
Week 12 Regression analysis using Stata
8
SO5041: Quantitative Research MethodsUnit 0: Course OutlineCourse outline
Note on assignments
Cooperation between students is encouraged but assignments must be thestudent’s own workPlease refer to Dept Plagiarism Policy at http://www.ul.ie/sociology/ (under“Student Resources”)Please use the Dept Assignment Declaration form with all assignmentshttp://www.ul.ie/sociology/ (under “Student Resources”)Note the Dept’s policy on deadlines http://www.ul.ie/sociology/ (under“Student Resources”)
9
SO5041: Quantitative Research MethodsUnit 0: Course OutlineCourse outline
Texts
Main text: Agresti and Finlay, Statistical Methods for the Social Sciences – goodintroduction to formal statistical methods, very clear and accessible (will use moreextensively in second semester course)For Stata:
Robson and Pevalin, The Stata Survival ManualKohler and Kreuter, Data Analysis using StataAcock, A Gentle Introduction to Stata
Dip into Alan Bryman, Social Research MethodsAlso David de Vaus, Surveys in Social Research: good on survey methodologyOther occasional readings
10
SO5041: Quantitative Research MethodsUnit 0: Course OutlineCourse outline
Some resources on the web (local)
Examples and data to download for lab sessionsAll accessible from http://teaching.sociology.ul.ie/so5041/
Course SULIS site (especially for assignments, see also course outline)
11
SO5041: Quantitative Research MethodsUnit 1: Introducing Quantitative Social Research
Unit 1: Introducing Quantitative Social Research
12
SO5041: Quantitative Research MethodsUnit 1: Introducing Quantitative Social Research
Qual versus Quant?
Social research is often divided into qualitative and quantitative:Is this a real division?What exactly is quantitative social research?
In some respects the distinction is clear:Quantitative research is concerned with numbers and statistical analysis, typically oflarge-scale surveysQualitative research is less obviously concerned with number, and typically is smallerscale and more in-depth
‘Qualitative’ actually covers a diverse range of research and methodBut sometimes this division is over-stated, e.g., “qualitative sociology” versus“quantitative sociology”It is only at the level of method that the division is clear
12
SO5041: Quantitative Research MethodsUnit 1: Introducing Quantitative Social Research
Positivism–Interpretivism?
Some commentators see major philosophical divisionsquantitative ≡ positivistqualitative ≡ interpretivist, anti-positivist
While there are some parallels, this is misleading: quantitative research is notnecessarily positivist, and is not incompatible with interpretivist frameworksPositivism is a philosophy of science that holds that only that which can bemeasured existsHowever, it is often used to mean a naïvely scientific approach to social research,or as an inherently negative way of referring to the quantitative method
13
SO5041: Quantitative Research MethodsUnit 1: Introducing Quantitative Social Research
Positivism–Interpretivism?
In the past positivists have argued that social science should use exactly the samemethods as the natural sciences,
suggesting that social phenomena should be studied via measurement andobservation from the outside,strongly downplaying the role of understanding or of rational social action
14
SO5041: Quantitative Research MethodsUnit 1: Introducing Quantitative Social Research
Critique of positivism
Positivism has many critics, in sociology often from an interpretivist perspective:the interpretation of the meaning of action and communication is paramount forthis sort of social researchSome attacks on positivism have been extended to attacks on quantitativemethods, but it is mistaken to treat them as the sameMany interpretivists criticise the quantitative method for not being able to dealwith the complexity of meaning and context in social life – this is not the same asattacking it for being positivist
15
SO5041: Quantitative Research MethodsUnit 1: Introducing Quantitative Social Research
Quantitative approaches and rationality
A great deal of quantitative research works in terms of knowledgeable rationalactors
Weber (key theorist for interpretivists) himself carried out a number of quantiativeprojectsMuch neo-Weberian sociology uses quantitative method, e.g., “rational actiontheory” school
Moreover, to justify quantitative sociology it is sufficient that you can say someuseful things via “measurement”, and conversely, all naturalistic research is in asense dealing with measurement and observationIncreasing use of "mixed methods": both qual and quant in the same project
16
SO5041: Quantitative Research MethodsUnit 1: Introducing Quantitative Social Research
Limitations
Though quantitative research is not necessarily positivist, it has clear limitationsRigid, almost industrial method – less opportunity for a dialogue between theory anddata collection during the research processRestrictive: you can only deal with the data you haveWeak on rare or hard-to-trace populations, or on rare eventsInstitutional context: large investment required means substantial pressure to followthe pressures of funding
17
SO5041: Quantitative Research MethodsUnit 1: Introducing Quantitative Social Research
Powerful
But it is very powerful for certain types of questionwhether certain processes or phenomena are present in a given populationwhere the relative size of competing effects is importantindeed, wherever it is necessary to generalise to a large population
Probably true to say that the best research is methodologically ecumenical
18
SO5041: Quantitative Research MethodsUnit 1: Introducing Quantitative Social ResearchAn Anatomy of Quantitative Research
“Scientific” model?
Though not necessarily positivist, quantitative research has strong affinities with a“scientific” model of the research process
the Theory ⇒ Hypothesis ⇒ Test cycleFalsification in the Popperian sensean interest in causal relations
19
SO5041: Quantitative Research MethodsUnit 1: Introducing Quantitative Social ResearchAn Anatomy of Quantitative Research
The experiment
The ideal type of this method is the experiment as, at least conceptuallysubjects allocated randomly to two groups,one control groupone “treatment” group exposed to the variable of interestwith the strong inference that differences in outcome between the two groups arecaused by the treatment
Experiments are, of course, almost always impossible in social research (and muchother research)
20
SO5041: Quantitative Research MethodsUnit 1: Introducing Quantitative Social ResearchAn Anatomy of Quantitative Research
Observational data
The alternative is the observational model:collect data “in the field”, e.g., using surveys or “observation”use multi-variable statistical techniques to search for causal relations
Necessarily weaker than the experimental model: much harder to be sure you arenot missing another reason for a given effectGenerally harder to reason about causality with observational data: the evidenceyou have is simply association or correlation
21
SO5041: Quantitative Research MethodsUnit 1: Introducing Quantitative Social ResearchAn Anatomy of Quantitative Research
Ambiguity
That is, your survey may show that unemployed people have lower mental healththan other categories (an “association” between employment status and mentalhealth), but cannot tell you why:
unemployment might simply be bad for youpeople with mental health problems may be more prone to lose jobsor a third factor may affect both (e.g., geography: in one region there may be higherunemployment and higher mental ill-health for unrelated reasons)
22
SO5041: Quantitative Research MethodsUnit 1: Introducing Quantitative Social ResearchAn Anatomy of Quantitative Research
Taking time into account
Longitudinal observational designs helpif respondents are observed at two time points, and we systematically see peoplemoving from employment to unemployment and equanimity to depression (and viceversa), we can exclude the “third factor” argumentif we observe people more frequently, we may be able to observe the onset onunemployment before the onset of depression at an individual (or vice versa) andargue with more confidence for a causal relationship
23
SO5041: Quantitative Research MethodsUnit 2: Surveys, Questionnaires and Sampling
Unit 2: Surveys, Questionnaires and Sampling
24
SO5041: Quantitative Research MethodsUnit 2: Surveys, Questionnaires and Sampling
Surveys and Survey Research
Social survey research is very widespread:political opinion polls and market researchEU-wide Labour Force Survey & CSO’s Quarterly National Household Survey (LFSplus)EU EurobarometerEuropean Social SurveyInternational Social Survey ProgrammeHousehold Budget SurveyUK Family Expenditure Survey, General Household SurveyGrowing up in Ireland, TILDASlán, Irish Study of Sexual Health and Relationshipssurveys of business opinion, of inventories etc.many emanating from ESRI or government
24
SO5041: Quantitative Research MethodsUnit 2: Surveys, Questionnaires and Sampling
Surveys: representativity
The key principle of survey research is representativity: because the sample israndom, summaries of the sample’s characteristics can be imputed to the relevantpopulationSometimes we end up with too few cases of a subgroup to analyse – e.g., ethnicminorities; over-sampling or specially targetted surveys may help
25
SO5041: Quantitative Research MethodsUnit 2: Surveys, Questionnaires and Sampling
Longitudinal surveys
Longitudinal surveys are a special casePanel surveys the same sample at regular intervals (e.g., European CommunityHousehold Panel, US Panel Study of Income Dynamics, German Socio-EconomicPanel, British ‘Understanding Society’ Panel Study)Retrospective studies ask respondents to report complete life histories retrospectively(Irish Mobility Study, UK Family and Working Lives Survey, German Life HistoryStudy, etc.)Cohort studies take a group of subjects and follow them forward (e.g., the GrowingUp in Ireland study, The Irish Longitudinal Study on Ageing)
Taking time into account makes these in many ways much richer data sources
26
SO5041: Quantitative Research MethodsUnit 2: Surveys, Questionnaires and Sampling
Questionnaire design
The questionnaire is the linchpin of the surveyMust elicit right information with minimum of ambiguity or suggestion, minimuminconvenience to the respondentQuestion design is a black art, since small changes of phrasing may cause differentresultsExtensive reliance on standardised questions, or standardised forms of questions(e.g., the typical five point answer scale: strongly agree, agree, neutral, disagree,strongly disagree)Standard schedules exist for certain purposes, e.g., the General HealthQuestionnaireV important to minimise “open” questions: much cheaper to pre-code answers (butallow an “other, specify” answer)Very important to test questionnaires in a pilot survey, to trap ambiguities andother problems, and to help pre-code questions
27
SO5041: Quantitative Research MethodsUnit 2: Surveys, Questionnaires and SamplingInference
Inference, sampling and statistics
The basis of survey research is inference: the projection of the characteristics ofthe sample onto the populationThis requires a random (or quasi-random) sample: each member of the populationof interest has an equal chance of being selectedConsider a simple summary: the mean
x =
∑xin
28
SO5041: Quantitative Research MethodsUnit 2: Surveys, Questionnaires and SamplingInference
Calculations on samples
We could calculate the mean for the entire population but that would be expensiveWe can calculate the mean for a random sample: much cheaper and the answerapproximates the true answer in a probabilistic wayThat is, if we were to draw a large number of samples and calculate all the samplemeans, these sample means would be distributed about the true (population) valuewith an approximately normal distributionIn only drawing one sample, we infer that the true mean lies probabilistically aroundthe sample mean, with a large sample giving a tighter distribution that a small one
29
SO5041: Quantitative Research MethodsUnit 2: Surveys, Questionnaires and SamplingInference
Error, but maybe not too much
We can also calculate from the data how wide this probability distribution is: thisis the basis for “significance”, a measure of how much the sample summary can betrustedIt is usually the case that a sample a tiny fraction the size of the population willprovide good estimatesThe same reasoning applies to means, frequency distributions, correlationcoefficients, cross-tabulations, regression coefficients etc.
30
SO5041: Quantitative Research MethodsUnit 2: Surveys, Questionnaires and SamplingInference
Multi-variable techniques
Multi-variable techniques are the key to causal analysis: take account of the effectsof many variables simultaneously to estimate the net effect of eachRegression analysis is the most commonly used multi-variable technique, and issimilar in principle to most multi-variable techniques - It is used with a dependentvariable that is continuous (i.e., like age, time, income etc., rather than religion,sex, nationality which are categorical)
31
SO5041: Quantitative Research MethodsUnit 3: Numbers as Information – DataProcessing information numerically
Unit 3: Numbers as Information – Data
32
SO5041: Quantitative Research MethodsUnit 3: Numbers as Information – DataProcessing information numerically
Number as information
In the Lab we will begin by coding data in numerical form, entering it on thecomputer and using Stata to make simple descriptive summaries of itThis activity is an important stage of the quantitative research process: turninginformation into numbers and numbers into information
32
SO5041: Quantitative Research MethodsUnit 3: Numbers as Information – DataProcessing information numerically
Steps
There are several important steps in this processDetermine a clear set of information items to collect (i.e., what questions you wantto ask)Determine a simple way of representing this information
Represent it as a single “quantitative” number, e.g., Euros per annum for income, orDetermine a fixed set of categories grouping every possible answer, and attach anumber to each category
Collect it (easier said than done) and enter on a computerBut for the analysis to make sense we need to bring in information about that thenumbers mean: variable labels and value labels restore some part of the meaningfulinformation hidden by coding as numbers
33
SO5041: Quantitative Research MethodsUnit 3: Numbers as Information – DataUnivariate summaries
Descriptive summaries
When we enter data on a computer we can easily present descriptive univariatesummaries (i.e., summarising one variable at a time)For categorical or grouped variables (i.e., where there is a “small” number ofdifferent values) we can use a frequency table: how many of each sort are there?what proportion of each sort are there?For variables with a “quantitative” interpretation (i.e., where the number measuressomething like age, income, duration, distance, or where it is a count like numberof children, number of cigarettes per day) we can explore the mean or average, orperhaps the median
34
SO5041: Quantitative Research MethodsUnit 3: Numbers as Information – DataUnivariate summaries
Frequency Table
. tab rjbstatcurrent economic
activity Freq. Percent Cum.
self-employed 929 6.82 6.82employed 6,749 49.58 56.41
unemployed 468 3.44 59.84retired 3,034 22.29 82.13
maternity leave 75 0.55 82.68family care 786 5.77 88.46
ft studt, school 878 6.45 94.91lt sick, disabld 608 4.47 99.38gvt trng scheme 22 0.16 99.54
other 63 0.46 100.00
Total 13,612 100.00
35
SO5041: Quantitative Research MethodsUnit 3: Numbers as Information – DataUnivariate summaries
Summarising a “quantitative” variable
. su rfimnVariable Obs Mean Std. Dev. Min Max
rfimn 13,612 1454.687 1459.462 0 56916.67
36
SO5041: Quantitative Research MethodsUnit 3: Numbers as Information – DataUnivariate summaries
Codebook
. codebook rfimn
rfimn total income: last month
type: numeric (double)label: rfimn, but 9658 nonmissing values are not labeledrange: [0,56916.668] units: 1.000e-08
unique values: 9,658 missing .: 0/13,612examples: 470.71835
925.075011401.99292175.7588
37
SO5041: Quantitative Research MethodsUnit 3: Numbers as Information – DataUnivariate summaries
Mean, Median
The mean is defined as the sum total of the variable, divided by the number ofcases:
x =
∑xin
The median is defined as the value such that 50% of cases are lower and 50% ofcases are higher
38
SO5041: Quantitative Research MethodsUnit 3: Numbers as Information – DataUnivariate summaries
Calculating the median
To calculate the median “by hand”:sort the values in orderif there is an odd number of cases (e.g., 101), find the middle one (e.g., 51st) – itsvalue is the medianif there is an even number of cases (e.g., 100), find the middle pair (e.g., 50th and51st) – the median is half way between their values
39
SO5041: Quantitative Research MethodsUnit 3: Numbers as Information – DataUnivariate summaries
Median via Stata
. centile rfimnBinom. Interp.
Variable Obs Percentile Centile [95% Conf. Interval]
rfimn 13,612 50 1141.668 1119.325 1166.667
40
SO5041: Quantitative Research MethodsUnit 3: Numbers as Information – DataUnivariate graphs
Graphs
We can also make graphical summariesFor categorical variables the bar-chart is very good: this is like a frequency tablewhere the height of the bars is proportional to the number/proportion in thatcategoryWe can also use Pie-charts: these are circles with segments (pie-slices) whoseangle is proportional to the size of the category (there are some arguments thatbar-charts are better)For quantitative variables we can use histograms: these are very like bar-charts,but break the “continuous” variable into groups. The bars in a histogram touch, toshow that the variable is really continuous. In a histogram, the area under the“curve” (or stepped line) for a given range is proportional to the numbers in thatrange. It is sometimes interesting to vary the group size (the “bin size”) to getmore or less detail
41
SO5041: Quantitative Research MethodsUnit 3: Numbers as Information – DataUnivariate graphs
Bar chart
graph hbar, over(rjbstat)
42
SO5041: Quantitative Research MethodsUnit 3: Numbers as Information – DataUnivariate graphs
Zero is important
43
SO5041: Quantitative Research MethodsUnit 3: Numbers as Information – DataUnivariate graphs
Beware: People lie with barcharts
44
SO5041: Quantitative Research MethodsUnit 3: Numbers as Information – DataUnivariate graphs
Pie chart
45
SO5041: Quantitative Research MethodsUnit 3: Numbers as Information – DataUnivariate graphs
Pie is bad for you
Source: http://junkcharts.typepad.com/junk_charts/2009/08/community-outreach-piemaking.html
46
SO5041: Quantitative Research MethodsUnit 3: Numbers as Information – DataUnivariate graphs
Histograms
hist rfimn if rfimn<=16000
47
SO5041: Quantitative Research MethodsUnit 3: Numbers as Information – DataUnivariate graphs
Different bin widths
hist rfimn if rfimn<=16000, bin(10)hist rfimn if rfimn<=16000, bin(100)
48
SO5041: Quantitative Research MethodsUnit 3: Numbers as Information – DataUnivariate graphs
Some interesting reading on graphics
William S Cleveland The Elements of Graphing Data. Rev. ed. Murray Hill, N.J.:AT&T Bell Laboratories, 1994Edward Tufte, The Visual Display of Quantitative Information. Cheshire, Conn.:Graphics Press, 1983Kieran Healy, Data Visualization: A Practical Introduction 2018. Focused on R,but Chapter 1 is a general discussion. Online at https://socviz.co/
49
SO5041: Quantitative Research MethodsUnit 4: Bivariate analyses
Unit 4: Bivariate analyses
50
SO5041: Quantitative Research MethodsUnit 4: Bivariate analyses
Types of variables
So far i have been using several terms for types of variable:CategoricalGroupedContinuousQuantitative
A more formal classification that corresponds with the sorts of analysis possible,goes:
NominalOrdinalIntervalRatio
50
SO5041: Quantitative Research MethodsUnit 4: Bivariate analyses
NOIR – categorical
Nominal variables have categories where each category is “just a name”, i.e.,nominal: example: religion, region of birth, political allegienceWith nominal variables we can’t do much more than present frequenciesOrdinal variables have categories where each category is something different, butwhere it is possible to put the categories in a meaningful high–low order. Examplesinclude highest maths qualification, exam letter grades, attitudes (agree strongly,agree, neutral, disagree, disagree strongly) and so on. with ordinal variables,frequencies are meaningful, and the “cumulative percent” column that Stataprovides is also meaningful. Additionally, with ordinal variables we can calculatethe median
51
SO5041: Quantitative Research MethodsUnit 4: Bivariate analyses
NOIR – quantitative
Quantitative variables can be interval or ratioInterval variables are variables like temperature, where the difference between, say,35° and 40° is the same as between 50° and 55°. That is, the meaning of aninterval is the same no matter where it is. However, with temperature (centigradeor fahrenheit) it is not true that 40° is twice as hot as 20°, because 0° is not a truestarting point. With interval variables it makes sense to calculate the mean. Thatis, if we have five days with temperatures of 11°, 9°, 12°, 11° and 10°, it makessense to say the average is 11+9+12+11+10
5 = 10.6
52
SO5041: Quantitative Research MethodsUnit 4: Bivariate analyses
NOIR – ratio variables
ratio variables are like interval variables, and are probably more commonlyencountered. these are quantitative variables, where the zero makes sense.distance, duration, money, age are all ratio variables, because 0 kilometres, 0seconds, €0 and 0 years are all things that make sense. ratio variables can also usemean, but unlike interval variables we can also work with proportions (twice as old,twice as rich, 10% farther and so on)
53
SO5041: Quantitative Research MethodsUnit 4: Bivariate analysesBivariate Summaries
Recap: Univariate analysis
Frequencies: tab foreign
Pie chart: graph pie, over(foreign)
Bar chart: graph bar, over(foreign)
Medians, etc: centile mpg, centile(25 50 75)
Mean, standard deviation: su mpg
Histogram: histogram mpg
54
SO5041: Quantitative Research MethodsUnit 4: Bivariate analysesBivariate Summaries
Bivariate Analysis
So far we have dealt with describing one variable at a time: important but limitedBi-variate (two-variable) summaries allow us to look at relationships betweenvariables as wellWhere both variables are categorical (nominal, ordinal or grouped) we can usetwo-way tables, cross-tabulations
55
SO5041: Quantitative Research MethodsUnit 4: Bivariate analysesBivariate Summaries
Cross-tabulation
. use http://teaching.sociology.ul.ie/so5041/labs/week3.dta
. tab empstat sexusual employment school leavers sex
situation 1 2 Total
working for payment 388 380 768unemployed 67 46 113
looking for 1st job 170 151 321student 471 490 961
engaged in home dutie 0 19 19permanent disability/ 4 0 4
other 4 7 11
Total 1,104 1,093 2,197
56
SO5041: Quantitative Research MethodsUnit 4: Bivariate analysesBivariate Summaries
Column Percentages. use http://teaching.sociology.ul.ie/so5041/labs/week3.dta. tab empstat sex, col
Key
frequencycolumn percentage
usual employment school leavers sexsituation 1 2 Total
working for payment 388 380 76835.14 34.77 34.96
unemployed 67 46 1136.07 4.21 5.14
looking for 1st job 170 151 32115.40 13.82 14.61
student 471 490 96142.66 44.83 43.74
engaged in home dutie 0 19 190.00 1.74 0.86
permanent disability/ 4 0 40.36 0.00 0.18
other 4 7 110.36 0.64 0.50
Total 1,104 1,093 2,197100.00 100.00 100.00
57
SO5041: Quantitative Research MethodsUnit 4: Bivariate analysesBivariate Summaries
Too many percents!. use http://teaching.sociology.ul.ie/so5041/labs/week3.dta. tab empstat sex, row col
Key
frequencyrow percentage
column percentage
usual employment school leavers sexsituation 1 2 Total
working for payment 388 380 76850.52 49.48 100.0035.14 34.77 34.96
unemployed 67 46 11359.29 40.71 100.006.07 4.21 5.14
looking for 1st job 170 151 32152.96 47.04 100.0015.40 13.82 14.61
student 471 490 96149.01 50.99 100.0042.66 44.83 43.74
engaged in home dutie 0 19 190.00 100.00 100.000.00 1.74 0.86
permanent disability/ 4 0 4100.00 0.00 100.00
0.36 0.00 0.18
other 4 7 1136.36 63.64 100.000.36 0.64 0.50
Total 1,104 1,093 2,19750.25 49.75 100.00
100.00 100.00 100.00 58
SO5041: Quantitative Research MethodsUnit 4: Bivariate analysesBivariate Summaries
Ordinal variable. tab mopfama mopfamb, row
Key
frequencyrow percentage
pre-school childsuffers if mother family suffers if mother works full-time
works strongly agree neithr ag disagree strongly Total
strongly agree 572 328 79 32 21 1,03255.43 31.78 7.66 3.10 2.03 100.00
agree 267 2,507 876 354 37 4,0416.61 62.04 21.68 8.76 0.92 100.00
neithr agree, disagre 55 813 3,070 1,010 95 5,0431.09 16.12 60.88 20.03 1.88 100.00
disagree 32 295 437 2,777 344 3,8850.82 7.59 11.25 71.48 8.85 100.00
strongly disagree 6 23 40 164 680 9130.66 2.52 4.38 17.96 74.48 100.00
Total 932 3,966 4,502 4,337 1,177 14,9146.25 26.59 30.19 29.08 7.89 100.00
59
SO5041: Quantitative Research MethodsUnit 4: Bivariate analysesBivariate graphs
Bivariate graphs
Graphical ways of presenting the same information as a cross-tabulation includeclustered and stacked bar charts:Clusters are good for looking at distributions within groupsStacked bars are good where it is important to emphasise the sizes of the groups
60
SO5041: Quantitative Research MethodsUnit 4: Bivariate analysesBivariate graphs
Clustered bar chart
0 100 200 300 400 500Number of respondents
other
student
looking for 1st job
unemployed
working for payment
Male Female
61
SO5041: Quantitative Research MethodsUnit 4: Bivariate analysesBivariate graphs
Stacked bar chart
0 200 400 600 800 1,000Number of respondents
other
student
looking for 1st job
unemployed
working for payment
Male Female
62
SO5041: Quantitative Research MethodsUnit 4: Bivariate analysesBivariate graphs
Compare means – continuous within categorical
Where one of the variables is categorical and the other is continuous (interval orratio) we can “compare means”. That is, instead of calculating mean income, wecan calculate mean income for different groups
63
SO5041: Quantitative Research MethodsUnit 4: Bivariate analysesBivariate graphs
Wage by occupational group
. sysuse nlsw88, clear(NLSW, 1988 extract). tab occupation, su(wage)
Summary of hourly wageoccupation Mean Std. Dev. Freq.
Professio 10.723624 6.3510736 317Managers/ 10.899784 7.5215875 264
Sales 7.1544893 5.0427568 726Clerical/ 8.5166856 8.5653995 102Craftsmen 7.152988 3.7629626 53Operative 5.6538061 3.3861932 246Transport 3.2004937 1.3209668 28Laborers 4.9058895 3.6083499 286Farmers 8.0515299 0 1
Farm labo 3.0837354 .76638443 9Service 5.9887432 2.1399333 16
Household 6.3888881 .31313052 2Other 8.8362936 4.128914 187
Total 7.7779283 5.7613599 2,237
64
SO5041: Quantitative Research MethodsUnit 4: Bivariate analysesBivariate graphs
Another use of bar charts
0 2 4 6 8 10mean of wage
Other
Household workers
Service
Farm laborers
Farmers
Laborers
Transport
Operatives
Craftsmen
Clerical/unskilled
Sales
Managers/admin
Professional/technical
65
SO5041: Quantitative Research MethodsUnit 4: Bivariate analysesBivariate graphs
Boxplot
0 10 20 30 40hourly wage
Other
Household workers
Service
Farm laborers
Farmers
Laborers
Transport
Operatives
Craftsmen
Clerical/unskilled
Sales
Managers/admin
Professional/technical
66
SO5041: Quantitative Research MethodsUnit 4: Bivariate analysesBivariate graphs
Boxplot
The boxplot summarises the distribution of a variable:50% of the distribution is within the boxThe bar in the middle is the medianThe “whiskers” extend to the minimum and maximum, excluding “outliers”Outliers are cases more than 1.5 IQR above or below the corresponding quartile(these are often marked separately)
67
SO5041: Quantitative Research MethodsUnit 4: Bivariate analysesBivariate graphs
Three-way analysis
0 2 4 6 8 10mean of wage
Other
Household workers
Service
Farm laborers
Farmers
Laborers
Transport
Operatives
Craftsmen
Clerical/unskilled
Sales
Managers/admin
Professional/technical
nonunion union
68
SO5041: Quantitative Research MethodsUnit 4: Bivariate analysesBivariate graphs
Scatterplot
Where both variables are continuous, a very useful device is the scatterplot:
02000
4000
6000
8000
tota
l in
com
e: la
st m
onth
0 20 40 60 80 100no. of hours normally worked per week ~~
0100
200
300
400
net pay last w
eek
0 100 200 300 400 500gross pay last week
69
SO5041: Quantitative Research MethodsUnit 4: Bivariate analysesBivariate graphs
Background Reading
Agresti and Finlay, Statistical Methods for the Social Sciences, chapter 1(background) and chapter 3 (descriptive statistics)To follow: A&F, chapter 2, Sampling and Measurement%, F&G
70
SO5041: Quantitative Research MethodsUnit 5: More on types of variables
Unit 5: More on types of variables
71
SO5041: Quantitative Research MethodsUnit 5: More on types of variables
Spread in quantitative variables
When we have quantitative variables like age,income, we use measures of central tendencylike mean and median to describe the “middle”However, the dispersion about the middle isalso important: completely differentdistributions can have the same mean:
A B C1 5.5 1.02 5.5 4.03 5.5 4.54 5.5 5.25 5.5 5.36 5.5 5.67 5.5 5.98 5.5 6.09 5.5 7.5
10 5.5 10.0-------------
Mean 5.5 5.5 5.5
71
SO5041: Quantitative Research MethodsUnit 5: More on types of variables
Spread in quantitative variables
0.2
.4.6
0 5 10 0 5 10 0 5 10
1 2 3
Density
Three distributions with same mean
72
SO5041: Quantitative Research MethodsUnit 5: More on types of variables
Summarising spread
As well as summarising the middle, we often want to summarise the spread:RangeInter-quartile rangeStandard Deviation
The range is important, but depends too much on just the two extreme casesThe IQR is much more stableThe standard deviation has useful statistical properties
73
SO5041: Quantitative Research MethodsUnit 5: More on types of variables
The “Standard Deviation”
The standard deviation is calculated from the “deviations” from the mean, X :
Deviation = Xi − X
The deviations add to zero so we can’t just add them up: instead square them andthen add them up: ∑
(Xi − X )2
To get a sort of average, we divide by sample size minus 1, and take the squareroot:
σ =
√∑(Xi − X )2
n − 1
74
SO5041: Quantitative Research MethodsUnit 5: More on types of variables
Standard Deviation: good measure of spread
The standard deviation indicates how spread out the data is: the more spread out,the bigger the StDevIt’s a good measure because it depends on every single case, not just the extremepairs (range) or the quartilesIt has useful statistical properties which help when we come to statistical inference
75
SO5041: Quantitative Research MethodsUnit 5: More on types of variablesMore on types of variables
More on types of variables
We have already divided variables into nominal/ordinal/interval/ratioWe can also differentiate between “qualitative” and “quantitative”:
Nominal variables are clearly qualitative: observations with different values havedifferent qualitiesInterval/ratio variables directly represent quantities: amount of money, number ofchildren, distance
76
SO5041: Quantitative Research MethodsUnit 5: More on types of variablesMore on types of variables
What about ordinal variables?
Where do ordinal variables (e.g., level of education) fit in?In between?Often treated as qualitative – categories are clearly “different”Sometimes treated as quantitative: e.g., letter-grades are given scores for calculatingQCA; “Likert” scale answers (strongly agree to strongly disagree) are given scores of-2, -1, 0, 1, 2Applying scores to an ordinal variable implies there is an underlying interval/ratiovariable which we do not measureIn the QCA example this is true: the individual exam marks are on a ratio scaleIn the likert-scale example, we may feel there is a continuum ofagreement–disagreement which is approximately measured by the five options
77
SO5041: Quantitative Research MethodsUnit 5: More on types of variablesMore on types of variables
Sensitivity
There, however, we may feel that the steps between neutral and (dis)agree arebigger than between (dis)agree and strongly (dis)agree so we might want thescores to go -3, -2, 0, 2, 3 (or -3, -1, 0, 1, 3)When we base an analysis on applied scores like these, it is good to do a“sensitivity analysis”: compare the results with different scoring systems to see howsensitive your analysis is to the scoring method
78
SO5041: Quantitative Research MethodsUnit 5: More on types of variablesMore on types of variables
Summary of 5-point ordinal scale
Question VP P Neutral G VG N ScoreOverall satisfaction with 1.4 1.7 23.2 52.6 21.1 289 0.781Department of SociologyNational reputation of 0.7 2.5 16.5 36.6 43.7 279 0.840University of LimerickOverall quality of academic 0.0 5.1 13.6 54.2 27.1 295 0.807programme within ULOverall quality of services 1.0 7.7 20.1 47.8 23.4 299 0.770provided by ULOverall quality of the social 0.7 3.2 23.6 36.6 35.9 284 0.808aspects at UL servicesQuality of Student Academic 3.9 12.5 33.9 37.1 12.5 280 0.684Administration officeQuality of Fees office 2.9 7.5 47.3 33.5 8.8 239 0.675
79
SO5041: Quantitative Research MethodsUnit 5: More on types of variablesMore on types of variables
Yet another distinction!
A new distinction: discrete versus continuousDiscrete variables have a finite number of possible values
Nominal and ordinal variables are by definition discreteCount variables are discrete: 0, 1, 2, 3 . . . – you can’t actually have 2.4 childrenGrouped interval variables are discrete
Continuous variables can have an infinite number of values in principle: 2, 2.4,2.43, 2.435, 2.4358 and so on.Ungrouped interval/ratio variables are continuous in principle: an infinite numberof values are possibleIn practice, we treat income as continuous though it can’t vary in amounts lessthan 1 cent (and age, though we often measure it in integer years, or to thenearest month, etc.)
80
SO5041: Quantitative Research MethodsUnit 6: Sampling
Unit 6: Sampling
81
SO5041: Quantitative Research MethodsUnit 6: Sampling
Sample not Census
Most quantitative research proceeds by using samples: rather than ask the entireelectorate every month or so who they would vote for, ask a random sampleIf these individuals are chosen at random, their responses will approximate those ofthe whole electorateRandom here does not mean completely without control, rather that eachindividual in the reference population has an equal chance of selection
81
SO5041: Quantitative Research MethodsUnit 6: Sampling
Simple Random Sampling
The basic sampling strategy is the simple random sampleevery subject in the population has the same chance of being selectedevery possible sample of that size has the same chance of being selected
How to select a random sample:Acquire a sampling frame: a list of all individuals in the reference populationNumber the individuals from 1 to N, the size of the populationDraw n (sample size) numbers from a random number table or a computer, in therange 1 to NInterview the corresponding individuals
82
SO5041: Quantitative Research MethodsUnit 6: Sampling
Random numbers
In the past random number tables were widely used:
Line/Col (1) (2) (3) (4)01 96252 48687 59641 9946002 71334 10218 07459 2033903 35932 10229 83514 7646104 33568 36397 92080 9843005 31535 29990 89596 7752906 16483 30849 18676 6322507 26374 60736 14522 6609608 12040 90130 91860 08280
Now computers do the job very conveniently: for instance in Excel typing =rand()in a cell gives you a random number between 0 and 1.
83
SO5041: Quantitative Research MethodsUnit 6: Sampling
Probability sampling
Simple random samples are an example of probability samplingProbability sample has the enormously important theoretical advantage ofrepresentativityIf properly conducted, the sample is guaranteed to be representatitiveThat is, all sample characteristics approximate those of the population – evencharacteristics you haven’t thought ofNon-probability sampling does not have these advantages but is sometimes usedfor other reasons
84
SO5041: Quantitative Research MethodsUnit 6: SamplingNon-random sampling
Non-probability sampling
There are many forms of non-probability sampling:volunteer sampling (self-selection)“streetcorner interview”quota sampling
Volunteer sampling is popular in the media: questionnaires in magazines, phone-inlines (vote for/against TV programme issue)Sometimes the sample is very largeHowever, because respondents are self-selecting serious bias can enter
85
SO5041: Quantitative Research MethodsUnit 6: SamplingNon-random sampling
Volunteer sample – dodgy inference!
Agresti/Finlay quote several examples (pp 20–21): e.g., TV phone vote on “shouldUN stay in US?” had a response of 186,000 with 67% voting for “No”A simultaneous random sample of 500 had 28% voting no: much smaller but farmore reliableTV phone-in subject to bias:
Who is watching the program? Time of day, interest, etc.Who is motivated to phone? Those with a strong opinion on the subject
Twitter polls a particularly bad example of volunteer samples ("retweet for a biggersample!")
86
SO5041: Quantitative Research MethodsUnit 6: SamplingNon-random sampling
“Opportunistic” samples
“Streetcorner interview” sampling simply interviews whoever comes along: randomin that senseHowever, where and when the interviewing is done will affect who is likely to beinterviewed: under-represent workers if during the working day, under-representcountry people if done in the city, over-represent shoppers if done in shopping area,etc.{}Quota sampling is a form of sampling that avoids these bias problems by usingquotas (of age group, sex, employment status, income group, region etc.) – peoplepassing by at “random” are interviewed only if they fit the quotaNot really probability sampling, but often a very effective and efficient way of“faking” a true random sample, especially for well-understood purposes like opinionpolling
87
SO5041: Quantitative Research MethodsUnit 6: SamplingCharacteristics of samples
Samples are representative
The characteristics of a random sample approximate those of the referencepopulation by virtue of randomisationFor instance, mean income in a sample (the “*sample statistic*”) will bemore-or-less close to the mean income for the population from which it is drawn(the “*population parameter*”)The sample statistic is a more-or-less good estimator of the population parameterThis is in general true for any statistic: proportion answering yes, distribution ofqualifications, average working hours, joint distributions (e.g., crosstabs) etc.{}
88
SO5041: Quantitative Research MethodsUnit 6: SamplingCharacteristics of samples
Samples are approximate
But because it comes from a sample, the sample statistic will not be an exactestimate of the population parameterThe sample statistic depends on chance: the exact contents of the sampleRepeating the exercise with a different sample will give a different sample statistic– still approximating the population parameter, of courseIn principle we can think of there being a distribution of possible sample statisticvalues, all falling more or less close to the true value of the population parameter
89
SO5041: Quantitative Research MethodsUnit 6: SamplingError in sampling
Sampling error
This difference between the statistic and the true value is called sampling errorIf the true proportion of UL students with part-time jobs is 65% and a samplereports 72% we have a sampling error of 72 – 65 = 7%How useful a sample is depends on the size of sampling error
The bigger the sample the smaller the sampling errorThe more variability in the population the larger the sampling error
How do we know how big the sampling error is? We don’t know the true value!With certain assumptions about the range of the true value and information aboutthe dispersion of the sample we can estimate it – e.g., with a sample of 1,000sampling error for proportions (%age working etc.) is around 3–4%
90
SO5041: Quantitative Research MethodsUnit 6: SamplingError in sampling
Systematic error
Sampling error is due to sampling variability: always presentOther sources of error can arise – systematic error is a problemError can arise from sampling problems
Undercoverage – e.g., homeless people, geographically mobile peopleBad sampling frame – e.g., out-of-date electoral register
Error can also arise from response problemsOutright refusal – “non-response”Refusal to answer certain questions – “item non-response”Bias or deception in answers
91
SO5041: Quantitative Research MethodsUnit 6: SamplingError in sampling
Survey design and error
Sampling error is part of the designNon-sampling error is a problem and can invalidate the conclusions we wish tomake – bias the sample away from representativity of the reference populationUndercoverage or non-response introduces the possibility that the sample we getdiffers systematically from the population (by differing from the part we don’t get)For instance, if we are interested in financial wellbeing and our surveyunderrepresents homeless people or people who move very often the people in thesample may be systematically better off than those we fail to traceIf UL students working part-time are too busy to respond to a questionnaire oursample will under-estimate the average hours spent working part-time
92
SO5041: Quantitative Research MethodsUnit 6: SamplingMore complex sampling strategies
More sampling strategies
We have discussed several types of sampling strategySimple random samplingVolunteer sampling and “street-corner interviewing”Quota sampling
Random sampling is probability sampling; the others are notOther forms of probability sampling also exist:
Systematic samplingStratified samplingCluster samplingMultistage sampling
93
SO5041: Quantitative Research MethodsUnit 6: SamplingMore complex sampling strategies
Systematic sampling
Systematic sampling is used where the sampling frame has a more-or-less randomorderPick a name at random near the start of the sampling frameSkip k cases, pick the next name, and continue until finishedk , the size of the skip, is calculated as N/n, so that you arrive at the full size ofthe desired sampleThis works as a simple random sample, on the assumption that there is norelationship between where you are in the list and the relevant characteristics youhaveIf there is any order to the sampling frame (like people being grouped together, forinstance, or periodicity) this will not yield a random sample
94
SO5041: Quantitative Research MethodsUnit 6: SamplingMore complex sampling strategies
Stratified sampling
Stratified sampling works by dividing the population into strata and collectingrandom samples within the strataIf the groups or strata are defined according to variables important to the study(e.g., age, gender, occupational group) this method yields statistically moreefficient samples (i.e., sampling error is less than with a simple random sample)Where we know the population distribution of age or gender (e.g., from a census)our sample will automatically have the right proportionsThis method also allows over-sampling of small groups: ethnic minorities, peopleon FÁS schemes, the unemployed – this allows comparisons small groups andothers that would not be possible in a simple random sample
95
SO5041: Quantitative Research MethodsUnit 6: SamplingMore complex sampling strategies
Cluster sampling
Cluster sampling divides the population into groups called clustersThese are often geographic areas, or organisationally basedA random sample of clusters is drawnWithin each cluster, simple random samples of individuals are drawn
96
SO5041: Quantitative Research MethodsUnit 6: SamplingMore complex sampling strategies
Cluster sampling – pros and cons
Advantages:Cost – much less interviewer travel timeCan draw better random samples within cluster than from national sampling frame(e.g., identify all households in cluster first, then sample: better than out-of-date listof addresses)
Disadvantage: statistically less efficient – larger sample needed for same accuracy
97
SO5041: Quantitative Research MethodsUnit 6: SamplingMore complex sampling strategies
Multistage sampling
Multistage samples use several of these methods in combinationFor instance, using electoral wards as clusters, take a simple random sample ofclusters; within each ward, treat streets as clusters and sample again; within eachstreet identify every separate address and do a systematic sample of every nthhousehold
98
SO5041: Quantitative Research MethodsUnit 6: SamplingMore complex sampling strategies
Reading
A&F Chapter 2Bryman Ch 8 (see also Ch 7)
99
SO5041: Quantitative Research MethodsUnit 7: DistributionsCharacteristics of distributions
Unit 7: Distributions
100
SO5041: Quantitative Research MethodsUnit 7: DistributionsCharacteristics of distributions
Distributions
We have seen how to display and summarise the distribution of variables:Categorical: frequency distribution, percentage distribution, bar and pie chartsQuantitative (interval/ratio): mean, median, IQR, standard deviation, histogram,box-plot
100
SO5041: Quantitative Research MethodsUnit 7: DistributionsCharacteristics of distributions
Distributions have shapes
The shape of the histogram tells us about the distribution of the variableIf a variable is “uniformly” distributed we see a flat distribution between theextremes:
05
.0e
−0
4.0
01
De
nsity
0 200 400 600 800 1000flat
101
SO5041: Quantitative Research MethodsUnit 7: DistributionsCharacteristics of distributions
Heaped distributions
More often we see “heaped distributions” where more of the observations clusteraround the centre, like this example (age) from the 1999 school-leavers’ survey:
0.1
.2.3
De
nsity
15 20 25 30 35 40age
102
SO5041: Quantitative Research MethodsUnit 7: DistributionsCharacteristics of distributions
Many patterns
There are many patterns we might see:UniformExtremesBimodalUni-modal
103
SO5041: Quantitative Research MethodsUnit 7: DistributionsCharacteristics of distributions
Polarisation
http://www.fivethirtyeight.com
104
SO5041: Quantitative Research MethodsUnit 7: DistributionsCharacteristics of distributions
Uni-modal differences
Symmetric (with different levels of kurtosis)platykurtic – flattermesokurtic – averageleptokurtic – very concentrated around centre
AsymmetricPositively skewed (to right)Negatively skewed (to left)
105
SO5041: Quantitative Research MethodsUnit 7: DistributionsThe Normal distribution
Mathematically defined distributions
There are some patterns that are defined “theoretically” or mathematicallyAmong these the most important is the Normal Distribution:
Mean
St. Deviation
This is the famous “bell-shaped curve”
106
SO5041: Quantitative Research MethodsUnit 7: DistributionsThe Normal distribution
The Normal Distribution
The normal distribution issymmetric (no skew)mesokurtic (between flatter and pointier)
The mean, mode and median are the sameThe farther you go from the mean, the lower the proportion of cases, in eachdirection symmetrically
107
SO5041: Quantitative Research MethodsUnit 7: DistributionsThe Normal distribution
Histograms display distributions
We can see what a histogram of a variable drawn from a normally distributedpopulation looks like:
0.1
.2.3
.4.5
Density
−2 −1 0 1 2normal %
0.1
.2.3
.4D
ensity
−4 −2 0 2 4normal %
0.1
.2.3
.4D
ensity
−4 −2 0 2 4normal
100 cases 1,000 cases 10,000 cases
As the sample gets bigger, the histogram approximates the theoretical distributionbetter
108
SO5041: Quantitative Research MethodsUnit 7: DistributionsThe Normal distribution
Normal: mean and σ
What makes the normal distribution useful is that its form is well understood:It is completely characterised by its mean and its standard deviation
Mean
Same mean, different SD Same SD, different mean
Online app: http://teaching.sociology.ul.ie:3838/apps/normsd
109
SO5041: Quantitative Research MethodsUnit 7: DistributionsThe Normal distribution
Normal distribution: well-understood
90% of distributionbetween 1.645and -1.645
95% of distribution
and -1.960
97.5% of distributionbetween 2.241
between 1.960
and -2.241
0 1.962.24
1.6450-1.96
-2.24
-1.645-1 1
68% of distributionbetween 1.0and -1.0
About 68% of the cases in a normal distribution are within 1 SD on either side of its mean
95% are within ± 1.96 standard deviations of the mean
97.5% are within ± 2.24 standard deviations of the mean
110
SO5041: Quantitative Research MethodsUnit 8: More on DistributionsCharacteristics of distributions
Unit 8: More on Distributions
111
SO5041: Quantitative Research MethodsUnit 8: More on DistributionsCharacteristics of distributions
Histograms and distributions
Histograms display distributionsProbability distributions describe them formallyThe set of ticket numbers in a raffle has a uniform distribution
flat histogramequal numbers of tickets in all ranges, e.g., 1–100, 101–200, 201–300Thus winning ticket is equally likely to fall in any range: number is not related tochance of selection
111
SO5041: Quantitative Research MethodsUnit 8: More on DistributionsCharacteristics of distributions
Heaped histograms
However, if we were to pick a school-leaver at random from the school-leaver’ssurvey, there is a relationship with date of birth:
dates nearer June 1974 more likelydates much later or much earlier much less likely
. . . a clustered distribution
112
SO5041: Quantitative Research MethodsUnit 8: More on DistributionsCharacteristics of distributions
Shape and standard deviation
The extent of clustering depends on shape, standard deviationSmaller standard deviation means individual chosen at random more likely to fallnear the meanThe Normal Distribution a kind of “ideal type” of clustered symmetric distribution
113
SO5041: Quantitative Research MethodsUnit 8: More on DistributionsCharacteristics of distributions
Examples of normally distributed variables
IQ scores, GRE scores – designed that wayTime for a fast-food delivery – possibly normal, e.g., mean of 30 mins, standarddeviation of 5 minsAlcohol consumption of non-abstainers, e.g., mean of 20 units per week, standarddeviation of 7 unitsHow far darts hit relative to their target
114
SO5041: Quantitative Research MethodsUnit 8: More on DistributionsCharacteristics of distributions
Mean and St Deviation are enough
Combining the mean, standard deviation and the “knowledge” that the distributionis normal means we can judge
the proportions above, below or between any given valuesthe chance of a case chosen at random of falling in any given range
115
SO5041: Quantitative Research MethodsUnit 8: More on DistributionsThe Standard Normal Distribution
Standard Normal Distribution
The Standard Normal Distribution is a special case of the normal distribution:Mean = 0Standard deviation = 1
We can map any given ND onto it bysubtracting the meandividing by the standard deviation
We can thus use the SND to estimate probabilities/proportions for any normaldistribution, once we know the mean and standard deviation
116
SO5041: Quantitative Research MethodsUnit 8: More on DistributionsThe Standard Normal Distribution
Standard Normal Distribution TableTable of the Standard Normal Distribution
Right tail (probability of X > z).00 .01 .02 .03 .04 .05 .06 .07 .08 .09
.00 .5000 .4960 .4920 .4880 .4840 .4801 .4761 .4721 .4681 .4641
.10 .4602 .4562 .4522 .4483 .4443 .4404 .4364 .4325 .4286 .4247
.20 .4207 .4168 .4129 .4090 .4052 .4013 .3974 .3936 .3897 .3859
.30 .3821 .3783 .3745 .3707 .3669 .3632 .3594 .3557 .3520 .3483
.40 .3446 .3409 .3372 .3336 .3300 .3264 .3228 .3192 .3156 .3121
.50 .3085 .3050 .3015 .2981 .2946 .2912 .2877 .2843 .2810 .2776
.60 .2743 .2709 .2676 .2643 .2611 .2578 .2546 .2514 .2483 .2451
.70 .2420 .2389 .2358 .2327 .2296 .2266 .2236 .2206 .2177 .2148
.80 .2119 .2090 .2061 .2033 .2005 .1977 .1949 .1922 .1894 .1867
.90 .1841 .1814 .1788 .1762 .1736 .1711 .1685 .1660 .1635 .16111.00 .1587 .1562 .1539 .1515 .1492 .1469 .1446 .1423 .1401 .13791.10 .1357 .1335 .1314 .1292 .1271 .1251 .1230 .1210 .1190 .11701.20 .1151 .1131 .1112 .1093 .1075 .1056 .1038 .1020 .1003 .09851.30 .0968 .0951 .0934 .0918 .0901 .0885 .0869 .0853 .0838 .08231.40 .0808 .0793 .0778 .0764 .0749 .0735 .0721 .0708 .0694 .06811.50 .0668 .0655 .0643 .0630 .0618 .0606 .0594 .0582 .0571 .05591.60 .0548 .0537 .0526 .0516 .0505 .0495 .0485 .0475 .0465 .04551.70 .0446 .0436 .0427 .0418 .0409 .0401 .0392 .0384 .0375 .03671.80 .0359 .0351 .0344 .0336 .0329 .0322 .0314 .0307 .0301 .02941.90 .0287 .0281 .0274 .0268 .0262 .0256 .0250 .0244 .0239 .02332.00 .0228 .0222 .0217 .0212 .0207 .0202 .0197 .0192 .0188 .01832.10 .0179 .0174 .0170 .0166 .0162 .0158 .0154 .0150 .0146 .01432.20 .0139 .0136 .0132 .0129 .0125 .0122 .0119 .0116 .0113 .01102.30 .0107 .0104 .0102 .0099 .0096 .0094 .0091 .0089 .0087 .00842.40 .0082 .0080 .0078 .0075 .0073 .0071 .0069 .0068 .0066 .00642.50 .0062 .0060 .0059 .0057 .0055 .0054 .0052 .0051 .0049 .00482.60 .0047 .0045 .0044 .0043 .0041 .0040 .0039 .0038 .0037 .00362.70 .0035 .0034 .0033 .0032 .0031 .0030 .0029 .0028 .0027 .00262.80 .0026 .0025 .0024 .0023 .0023 .0022 .0021 .0021 .0020 .00192.90 .0019 .0018 .0018 .0017 .0016 .0016 .0015 .0015 .0014 .00143.00 .0013 .0013 .0013 .0012 .0012 .0011 .0011 .0011 .0010 .00103.10 .0010 .0009 .0009 .0009 .0008 .0008 .0008 .0008 .0007 .00073.20 .0007 .0007 .0006 .0006 .0006 .0006 .0006 .0005 .0005 .00053.30 .0005 .0005 .0005 .0004 .0004 .0004 .0004 .0004 .0004 .00033.40 .0003 .0003 .0003 .0003 .0003 .0003 .0003 .0003 .0003 .00023.50 .0002 .0002 .0002 .0002 .0002 .0002 .0002 .0002 .0002 .00023.60 .0002 .0002 .0001 .0001 .0001 .0001 .0001 .0001 .0001 .00013.70 .0001 .0001 .0001 .0001 .0001 .0001 .0001 .0001 .0001 .00013.80 .0001 .0001 .0001 .0001 .0001 .0001 .0001 .0001 .0001 .00013.90 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .00004.00 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000
117
SO5041: Quantitative Research MethodsUnit 8: More on DistributionsThe Standard Normal Distribution
Reading the table
The table is best suited to telling us theproportion of the distribution, or equivalently,the chance of picking a case at random above a given value above the mean
This is because it starts at zero (the mean of the SND) and goes up onlyHowever, since the normal distribution is symmetrical we can use this table forvalues below the mean too
118
SO5041: Quantitative Research MethodsUnit 8: More on DistributionsThe Standard Normal Distribution
Proportion above or below a given level above the mean
For example, given mean 100 and standard deviation 20, what’s the chance ofobserving a value above 130?
µ = 100, σ = 20,X = 130
⇒ z =X − µσ
=130− 100
20= 1.5
From the table, we see that z=1.5 corresponds to p=0.0668 or about 6.7%: 6.7%of the distribution is above 130Clearly, 100% - 6.7% = 93.3% of the distribution is below 130
119
SO5041: Quantitative Research MethodsUnit 8: More on DistributionsThe Standard Normal Distribution
Proportion below a given level below the mean
We use the symmetry to make calculations below the mean
For example, for the same distribution, what’s the chance of observing a valuebelow 85?
µ = 100, σ = 20,X = 85
⇒ z =X − µσ
=85− 100
20= −0.750
By symmetry, the chance of being below -0.75 is exactly the same as that of beingabove +0.75, so we read that:z = 0.75 ⇒ p = 0.2266, which means 22.7% of the distribution is below 85 (and77.3% = 100% - 22.7% is above 85)
120
SO5041: Quantitative Research MethodsUnit 8: More on DistributionsThe Standard Normal Distribution
Proportion between two values
Once we know how to calculate the proportion of the distribution above or belowany value, we can calculate the proportion between any pair of values
X1 X2
121
SO5041: Quantitative Research MethodsUnit 8: More on DistributionsThe Standard Normal Distribution
Proportion between two values
To calculate the proportion between X1 and X2, calculateThe proportion below X1: P(X < X1)The proportion above X2: P(X > X2)P(X1 < X < X2) = 1− P(X < X1)− P(X > X2)
122
SO5041: Quantitative Research MethodsUnit 8: More on DistributionsThe Standard Normal Distribution
Working backwards: given p find z
We may also wish to work in the opposite directionInstead of asking what proportion of the distribution is above X, we may ask whatis the level such that proportion p of the distribution is above it?For example, given the same distribution, what is the level such that only 5% ofthe distribution is above it?
123
SO5041: Quantitative Research MethodsUnit 8: More on DistributionsThe Standard Normal Distribution
Working backwards: given p find z
Given the same distribution, what is the level such that only 5% of the distributionis above it?We work backwards, starting in the body of the table by searching for the valuenearest to 5% or 0.050This corresponds with z = 1.645 (falls between 1.64 and 1.65)Reverse the formula: X = σ × z + µ = 20× 1.645 + 100 = 132.9Therefore, 5% of this distribution is above 132.9
124
SO5041: Quantitative Research MethodsUnit 8: More on DistributionsOnline apps
App: P<X
Find the proportion below X, (above or below the mean)
Link: Proportion below X
125
SO5041: Quantitative Research MethodsUnit 8: More on DistributionsOnline apps
App: P>X
Find the proportion above X, (above or below the mean)
Link: Proportion above X
126
SO5041: Quantitative Research MethodsUnit 8: More on DistributionsOnline apps
App: P between X1 and X2
X1 and X2 may both be on either side of, or straddle the mean.
Link: Proportion between X1 and X2
127
SO5041: Quantitative Research MethodsUnit 9: Sampling Distributions and the Central Limit TheoremOutline
/Unit 9: Sampling Distributions and the Central Limit Theorem /
128
SO5041: Quantitative Research MethodsUnit 9: Sampling Distributions and the Central Limit TheoremOutline
Outline
In this lecture we will examinePopulation parameters and sample estimatesThe Central Limit Theorem: this is why the normal distribution is importantHow to deal with the imprecision of sample estimates: confidence intervalsReading: Agresti and Finlay, Chapter 4
128
SO5041: Quantitative Research MethodsUnit 9: Sampling Distributions and the Central Limit TheoremReasoning about sampling error
Sampling error and sampling distributions
How do we reason about sampling and sampling error?Sampling Error = true value – sample value; but true value unknown!We can reason about the likely distribution of sampling errorIf the true value is µ, how far is a sample value likely to fall from it?−→ a distribution of possible sample values
129
SO5041: Quantitative Research MethodsUnit 9: Sampling Distributions and the Central Limit TheoremReasoning about sampling error
Referendum sample
Example: referendum with yes/no answerThe probability a randomly selected voter goes yes vs no is unknownThat is equivalent to saying the yes/no proportion is unknownWhat if we want to know overall proportion, and poll 1500 random voters?If we get a result of 54/46: how do we know we have any accuracy?
130
SO5041: Quantitative Research MethodsUnit 9: Sampling Distributions and the Central Limit TheoremReasoning about sampling error
Computer simulation
Agresti and Finlay do a computer simulation to testComputer selects 1500 random numbers: 0− 0.5 ⇒ yes 0.5− 1 ⇒ noFirst run gets 51.6% yes, next 49.1%They run it a million times, and make a histogram of the results: vast majoritybetween 0.46 and 0.54, ie +- 4%
131
SO5041: Quantitative Research MethodsUnit 9: Sampling Distributions and the Central Limit TheoremReasoning about sampling error
Simulation by hand: N = 4
We can alternatively work through a “toy” exampleYes and No equally likelySample size of 416 different possible combinations, each equally likely
Result is a “binomial” distributionNote number of possible combinations is 2N so this calculation is practical onlywith tiny samples
132
SO5041: Quantitative Research MethodsUnit 9: Sampling Distributions and the Central Limit TheoremReasoning about sampling error
Simulating with N=4
N
Y
133
SO5041: Quantitative Research MethodsUnit 9: Sampling Distributions and the Central Limit TheoremReasoning about sampling error
Simulating with N=4
N N
N Y
Y N
Y Y
133
SO5041: Quantitative Research MethodsUnit 9: Sampling Distributions and the Central Limit TheoremReasoning about sampling error
Simulating with N=4
N N N
N N Y
N Y N
N Y Y
Y N N
Y N Y
Y Y N
Y Y Y
133
SO5041: Quantitative Research MethodsUnit 9: Sampling Distributions and the Central Limit TheoremReasoning about sampling error
Simulating with N=4
N N N NN N N YN N Y NN N Y YN Y N NN Y N YN Y Y NN Y Y YY N N NY N N YY N Y NY N Y YY Y N NY Y N YY Y Y NY Y Y Y
133
SO5041: Quantitative Research MethodsUnit 9: Sampling Distributions and the Central Limit TheoremReasoning about sampling error
Simulating with N=4
N N N N 0 : 4N N N Y 1 : 3N N Y N 1 : 3N N Y Y 2 : 2N Y N N 1 : 3N Y N Y 2 : 2N Y Y N 2 : 2N Y Y Y 3 : 1Y N N N 1 : 3Y N N Y 2 : 2Y N Y N 2 : 2Y N Y Y 3 : 1Y Y N N 2 : 2Y Y N Y 3 : 1Y Y Y N 3 : 1Y Y Y Y 4 : 0
133
SO5041: Quantitative Research MethodsUnit 9: Sampling Distributions and the Central Limit TheoremReasoning about sampling error
Simulating with N=4
N N N N 0 : 4N N N Y 1 : 3N N Y N 1 : 3N Y N N 1 : 3Y N N N 1 : 3N N Y Y 2 : 2N Y N Y 2 : 2N Y Y N 2 : 2Y N N Y 2 : 2Y N Y N 2 : 2Y Y N N 2 : 2N Y Y Y 3 : 1Y N Y Y 3 : 1Y Y N Y 3 : 1Y Y Y N 3 : 1Y Y Y Y 4 : 0
133
SO5041: Quantitative Research MethodsUnit 9: Sampling Distributions and the Central Limit TheoremReasoning about sampling error
N = 4
05
10152025303540
0 25 50 75 100
Per
cent
age
of s
ampl
es
Predicted Yes Vote
Sampling distribution for N=4 134
SO5041: Quantitative Research MethodsUnit 9: Sampling Distributions and the Central Limit TheoremReasoning about sampling error
Online Apps:
Heads and TailsSimulating binomial sampling
135
SO5041: Quantitative Research MethodsUnit 9: Sampling Distributions and the Central Limit TheoremReasoning about sampling error
Central Limit Theorem
The Central Limit Theorem: for a sufficiently large sample size, the samplingdistribution of a statistic such as the sample mean will be approximately normalThis is true, whatever the distribution of the population of interestWe have seen this with the simulation of sampling vote where the true populationproportions are 50:50
136
SO5041: Quantitative Research MethodsUnit 9: Sampling Distributions and the Central Limit TheoremReasoning about sampling error
Binomial distribution
Population Distribution
0
10
20
30
40
50
60
Yes No
Per
cent
age
Vot
e
Sampling Distribution
PCT
55.00
53.00
51.00
49.00
47.00
45.00
4000
3000
2000
1000
0
Std. Dev = 1.29
Mean = 50.00
N = 10000.00
137
SO5041: Quantitative Research MethodsUnit 9: Sampling Distributions and the Central Limit TheoremReasoning about sampling error
Sampling distribution
This means that for sufficiently large samples, our sample statistic can be regardedas being drawn from a random distribution with mean µ and standard deviationσX = σ√
n
This is a very important theorem because it allows us to reason about the precisionof sample estimatesNote: σX , the standard deviation of the sample mean, is known as the standarderror
138
SO5041: Quantitative Research MethodsUnit 10: Sampling and confidence intervalsEstimating imprecision: confidence intervals
Unit 10: Sampling and confidence intervals
139
SO5041: Quantitative Research MethodsUnit 10: Sampling and confidence intervalsEstimating imprecision: confidence intervals
Point estimates
Agresti/Finlay: A point estimator of a parameter is a sample statistic that predictsthe value of that parameterFor instance, our sample mean X is a point estimator of the population mean µGood point estimators require two things:
To be centred around the true value (unbiased)To have as small a sampling error as possible (efficient)
139
SO5041: Quantitative Research MethodsUnit 10: Sampling and confidence intervalsEstimating imprecision: confidence intervals
Unbiased, efficient
Unbiasedness means that the estimate will fall around the true value, “on average”Efficient means that it will fall close to the true valueEstimators such as the sample mean, standard deviation, or sample proportion arereasonably efficient and unbiased
Sample mean:
X =
∑Xi
nSample standard deviation:
σ = s =
√∑(Xi − X )2
n − 1
Sample proportion:π1 =
n1
n1 + n2
140
SO5041: Quantitative Research MethodsUnit 10: Sampling and confidence intervalsEstimating imprecision: confidence intervals
Sampling distributions and sampling error
We have seen that sample estimates have sampling error, and now understandsomething of its characteristics, by exploring sampling distributionsSample estimates can be considered as being drawn from an imaginary randomdistribution, which (if the estimator is unbiased) will centre on the true populationparameter, with a level of imprecision measured by the standard error (which willbe as low as possible if the estimator is efficient)Where the sample is sufficiently large, the Central Limit Theorem tells us that thesampling distribution is normal, with mean µ and standard deviation σXWe can use this information to add to our point estimate a measure of itsprecision: the Confidence Interval
141
SO5041: Quantitative Research MethodsUnit 10: Sampling and confidence intervalsEstimating imprecision: confidence intervals
Confidence intervals
Agresti/Finlay: “A confidence interval for a parameter is a range of numbers withinwhich the parameter is believed to fall”“The probability that the confidence interval contains the parameter is called theconfidence coefficient” which is a number close to 1, like 95% or 99%A confidence interval is a band around our point estimate within which we canclaim that there is a, for instance, 95% chance that the true population value lies –how do we calculate this?We work from the sampling distribution – if we are estimating a mean value, weknow that 95% of estimates will fall within ± 1.96 standard errors of the mean
142
SO5041: Quantitative Research MethodsUnit 10: Sampling and confidence intervalsEstimating imprecision: confidence intervals
Example: transport spending
Assume we have a national sample, and are estimating spending on transport – ifthe true value is €35 per month, with a standard deviation of €5, and the samplesize is 1600, then 95% of all possible sample estimates will fall between
35− 1.96× 5√1600
and35 + 1.96× 5√
1600which is
µ± 1.96× σ√n
The 1.96 comes from the normal distribution: 95% of the distribution is between1.96 standard deviations above and below the mean
143
SO5041: Quantitative Research MethodsUnit 10: Sampling and confidence intervalsEstimating imprecision: confidence intervals
Reverse the reasoning
We don’t know µ, but we can reverse the reasoning and say that the true valuehas a 95% chance of falling in the range X ± 1.96× σXWe don’t know $ σX $, the standard error either:
σX =σ√n
However, we do know the standard deviation of the sample, s, and this is anunbiased estimator of the population standard deviation, σ, so we can estimate thestandard error:
σX =s√n
144
SO5041: Quantitative Research MethodsUnit 10: Sampling and confidence intervalsEstimating imprecision: confidence intervals
Calculating the interval
Thus in our example, if our sample mean is €34.75 and our sample standarddeviation is €7.7 we can calculate:
The standard error is 7.7√1600
= 7.740 = 0.1925
The lower bound is 34.75− 1.96× 0.1925 = 34.37The upper bound is 34.75 + 1.96× 0.1925 = 35.13
Thus our point estimate is 34.75 and our 95% confidence interval runs from 34.37to 35.13We can interpret this as saying there is a 95% chance that the true value is in thisrangeThis is because with 95% of all possible samples, such a confidence interval willinclude the true value
145
SO5041: Quantitative Research MethodsUnit 10: Sampling and confidence intervalsEstimating imprecision: confidence intervals
Levels of confidence
The 95% confidence level is often used but sometimes the chance of being wrongone time in twenty is not acceptable
We can use the normal distribution to choose other levels of confidence: forinstance, 99% of the normal distribution lies within ±2.58 standard deviationsAgain in our example (sample mean is 34.75%, standard deviation 7.7)
The lower bound is 34.75− 2.58× 0.1925 = 34.25The upper bound is 34.75 + 2.58× 0.1925 = 35.25
To be more sure, we are necessarily less precise
146
SO5041: Quantitative Research MethodsUnit 10: Sampling and confidence intervalsEstimating imprecision: confidence intervals
Standard deviation/error for proportions
The example above is for sample means: the CI for sample proportions is similarThe sampling distribution for a proportion (percent unemployed, percent votingRepublican, percent satisfied etc.) is also normal for large samplesThe standard deviation of a 0/1 or yes/no variable depends on the proportions inthe population however:
σπ =√π × (1− π)
We can estimate the standard error then as
σπ =
√π(1− π)
n
147
SO5041: Quantitative Research MethodsUnit 10: Sampling and confidence intervalsEstimating imprecision: confidence intervals
CI for proportions
The 95% confidence interval for a proportion uses this, the same was as for themeanIf we have a sample 1000 and find 45% of voters expressing an intention to voteyes we calculate as follows:
σπ =
√0.45(1− 0.45)
1000= 0.0157
and thus the CI is0.45± 1.96× 0.0157
That is π ± 0.031, i.e., plus or minus 3%
148
SO5041: Quantitative Research MethodsUnit 11: Two new distributions: t and χ2
The χ2 distribution
Unit 11: Two new distributions: t and χ2
149
SO5041: Quantitative Research MethodsUnit 11: Two new distributions: t and χ2
The χ2 distribution
The χ2 test for independence in tables
The χ2 test for association in tablesIndependence: no association between two variables
pattern of row percentages the same in all rowspattern of column percentages the same in all columns
Even if independence holds in the population, sampling variability leads todifferences in percentagesHow big can the differences be before we can be convinced that there is reallyassociation in the population?
149
SO5041: Quantitative Research MethodsUnit 11: Two new distributions: t and χ2
The χ2 distribution
Compare “observed” with “expected”
Method: compare the real table (“observed”) with hypothetical table underindependence (“expected”)Summarise the difference into a single figure (χ2 statistic, chi-sq)Compare χ2 statistic with known distribution. . .What is the probability of getting a sample statistic “at least this big” by simplesampling variability if independence holds in the population?
150
SO5041: Quantitative Research MethodsUnit 11: Two new distributions: t and χ2
The χ2 distribution
“Expected” ⇒“independence”
The “expected” table has the same row and column totals, but the cell values aresuch that the percentages are the same as in the total row and column:
nij =RiCj
T
For each cell we summarise the difference between observed (O) and expected (E )values as
(O − E )2
E
The summary for the table as a whole is the sum of this quantity across all cells:
χ2 =∑ (O − E )2
E
151
SO5041: Quantitative Research MethodsUnit 11: Two new distributions: t and χ2
The χ2 distribution
X 2 has a χ2 distribution
This statistic has a known distribution, the χ2 distributionThat is, if we take a large number of samples from a population where there is noassociation, and calculate the statistic, they will have a distribution in a knownform, and we can calculate the probability of finding a value “at least as large as”any given numberDepends only on the “degrees of freedom”: number of rows minus one times thenumber of columns minus one:
df = (r − 1)(c − 1)
152
SO5041: Quantitative Research MethodsUnit 11: Two new distributions: t and χ2
The χ2 distribution
Reading the χ2 distribution table
To read the table, we go to the row corresponding to the degrees of freedom, andread across until we get to the column with our chosen probability level (say 0.05)If our χ2 is bigger than the table value, then there is at most one chance in 20 (i.e.,0.05) that it has arisen by sampling variability, if the population has no association
153
SO5041: Quantitative Research MethodsUnit 11: Two new distributions: t and χ2
The χ2 distribution
χ2 distribution
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 2 4 6 8 10
df 1
df 2
df 3
df 4
df 5
Figure: The χ2 Distribution for various degrees of freedom154
SO5041: Quantitative Research MethodsUnit 11: Two new distributions: t and χ2
The χ2 distribution
Table of the χ2 distribution (chi-sq)
Values of the χ2 statistic for various degrees of freedom and areas under the right tail.(see full table)
Degrees of Area under right tailFreedom 0.100 0.050 0.025 0.010 0.0051 2.706 3.841 5.024 6.635 7.8792 4.605 5.991 7.378 9.210 10.5973 6.251 7.815 9.348 11.345 12.8384 7.779 9.488 11.143 13.277 14.8605 9.236 11.070 12.833 15.086 16.7506 10.645 12.592 14.449 16.812 18.5487 12.017 14.067 16.013 18.475 20.2788 13.362 15.507 17.535 20.090 21.9559 14.684 16.919 19.023 21.666 23.58910 15.987 18.307 20.483 23.209 25.188
155
SO5041: Quantitative Research MethodsUnit 11: Two new distributions: t and χ2
The t-distribution
The t-distribution
We have used the Standard Normal Distribution to make confidence intervalsaround sample means and sample proportionsThe depends on the Central Limit Theorem:
A sufficiently large sample drawn from any population will have a normaldistribution, even when the population distribution is not normal
In practice, for small samples the approximation to the normal distribution is notgood, so we use a related distribution called the t-distribution
156
SO5041: Quantitative Research MethodsUnit 11: Two new distributions: t and χ2
The t-distribution
Student’s t
The t-distribution is like the normal distribution but wider and flatterThe bigger the sample the closer it is to the normal distributionIt makes the CI a little wider for smaller sample sizes
157
SO5041: Quantitative Research MethodsUnit 11: Two new distributions: t and χ2
The t-distribution
Student’s identity
- It is known as “Student’s t”, after theGuinness statistician who invented it -William Sealey Gosset (1876–1937) -Guinness were wary of losing trade secretsso publication was banned
158
SO5041: Quantitative Research MethodsUnit 11: Two new distributions: t and χ2
The t-distribution
Degrees of freedom
In practice, we first determine the “degrees of freedom”: this is the sample sizeminus 1Then we choose our probability level – a 95% CI has a probability level of 5% or0.05Then we read off the t-value: this is used in constructing CIs exactly as the z-valuefrom the Standard Normal Distribution
159
SO5041: Quantitative Research MethodsUnit 11: Two new distributions: t and χ2
The t-distribution
Confidence interval with t
If the sample is large, Central Limit Theroem says sampling distribution is normal:use X ± z × SE for CIIf the sample is small, sampling distribution is wider than normal: use X ± t × SE
160
SO5041: Quantitative Research MethodsUnit 11: Two new distributions: t and χ2
The t-distribution
Table of the t-distribution (part)
Table of the Student’s t DistributionTwo-tailed probability
Degrees of Probability level (Area under both tails)Freedom 0.10 0.05 0.025 0.01 0.0051 6.314 12.706 25.452 63.657 127.3212 2.920 4.303 6.205 9.925 14.0893 2.353 3.182 4.177 5.841 7.4534 2.132 2.776 3.495 4.604 5.5985 2.015 2.571 3.163 4.032 4.7736 1.943 2.447 2.969 3.707 4.3177 1.895 2.365 2.841 3.499 4.0298 1.860 2.306 2.752 3.355 3.8339 1.833 2.262 2.685 3.250 3.69010 1.812 2.228 2.634 3.169 3.581. . .
(see full table) 161
SO5041: Quantitative Research MethodsUnit 11: Two new distributions: t and χ2
The t-distribution
A worked example
Example: X = (9, 2, 5, 5, 13, 13, 5, 5, 1, 4)
X = 6.2
s = 4.158
SE = 1.315
95% confidence, 9 degrees of freedom: t0.05,9 = 2.262
Low: 6.2− 2.262× 1.315 = 3.226
High: 6.2 + 2.262× 1.315 = 9.174
162
SO5041: Quantitative Research MethodsUnit 11: Two new distributions: t and χ2
The t-distribution
A worked example
Example: X = (9, 2, 5, 5, 13, 13, 5, 5, 1, 4)
X = 6.2
s = 4.158
SE = 1.315
95% confidence, 9 degrees of freedom: t0.05,9 = 2.262
Low: 6.2− 2.262× 1.315 = 3.226
High: 6.2 + 2.262× 1.315 = 9.174
162
SO5041: Quantitative Research MethodsUnit 11: Two new distributions: t and χ2
The t-distribution
A worked example
Example: X = (9, 2, 5, 5, 13, 13, 5, 5, 1, 4)
X = 6.2
s = 4.158
SE = 1.315
95% confidence, 9 degrees of freedom: t0.05,9 = 2.262
Low: 6.2− 2.262× 1.315 = 3.226
High: 6.2 + 2.262× 1.315 = 9.174
162
SO5041: Quantitative Research MethodsUnit 11: Two new distributions: t and χ2
The t-distribution
A worked example
Example: X = (9, 2, 5, 5, 13, 13, 5, 5, 1, 4)
X = 6.2
s = 4.158
SE = 1.315
95% confidence, 9 degrees of freedom: t0.05,9 = 2.262
Low: 6.2− 2.262× 1.315 = 3.226
High: 6.2 + 2.262× 1.315 = 9.174
162
SO5041: Quantitative Research MethodsUnit 11: Two new distributions: t and χ2
The t-distribution
A worked example
Example: X = (9, 2, 5, 5, 13, 13, 5, 5, 1, 4)
X = 6.2
s = 4.158
SE = 1.315
95% confidence, 9 degrees of freedom: t0.05,9 = 2.262
Low: 6.2− 2.262× 1.315 = 3.226
High: 6.2 + 2.262× 1.315 = 9.174
162
SO5041: Quantitative Research MethodsUnit 11: Two new distributions: t and χ2
The t-distribution
A worked example
Example: X = (9, 2, 5, 5, 13, 13, 5, 5, 1, 4)
X = 6.2
s = 4.158
SE = 1.315
95% confidence, 9 degrees of freedom: t0.05,9 = 2.262
Low: 6.2− 2.262× 1.315 = 3.226
High: 6.2 + 2.262× 1.315 = 9.174
162
SO5041: Quantitative Research MethodsUnit 11: Two new distributions: t and χ2
The t-distribution
A worked example
Example: X = (9, 2, 5, 5, 13, 13, 5, 5, 1, 4)
X = 6.2
s = 4.158
SE = 1.315
95% confidence, 9 degrees of freedom: t0.05,9 = 2.262
Low: 6.2− 2.262× 1.315 = 3.226
High: 6.2 + 2.262× 1.315 = 9.174
162
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignSurveys and Questionnaires
Unit 12: Questionnaire Design
163
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignSurveys and Questionnaires
Surveys and Questionnaires
The main way of collecting survey data is by means of questionnaires andstructured interviewsThe questionnaire used in structured interviewing is sometimes referred to as theinterview scheduleQuestionnaires filled out by the respondent him/herself are referred to asself-completion questionnairesThere are several important issues relating to the design and implementation ofquestionnaire-based surveysA good reference is DA de Vaus, Surveys in Social Research, chapter 6(“Constructing Questionnaires”)
163
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignSurveys and Questionnaires
Purpose of questionnaire
The practical purpose of a questionnaire survey is to get standard info from a largenumber of people relatively cheaply and reliablyThis requires a minimum of ambiguity in the meaning of the recorded answers –need to be clear and preciseIt also requires a minimum on non-response
Item non-response: refusal to answer or “don’t know” to specific questionsRespondent refusal: flat refusal to participate
Hence good design is important:Clarity of questions and structureNo unnecessary burden on the respondent
164
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignMeasurement Theory
Measurement Theory
A questionnaire is a data collection instrumentThis implies that what is measured must be standardised: the same questionsasked of every respondent, in the same way, in order that the data recorded meansthe same thing – standardisation is what allows the use of computers (andstatistics) to analyse itThis is related to Reliability: the instrument will get the same results in the samecircumstances
165
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignMeasurement Theory
Validity
Validity is also important: an instrument (e.g., a set of questions) is valid insofaras it measures what it is intended to measureAn instrument can be valid without being reliable, and vice versaFor instance, a set of questions designed to assess right-wing attitudes may bereliable (provide consistent results) but not valid (if scoring high is not verystrongly associated with being right-wing, etc.)
166
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignMeasurement Theory
Reliability, comparison, validity
Reliability is very important for comparison: without it we cannot compare cases orattribute differences between cases to their other characteristics – reliability isessential for analysisWithout validity, there is no point to analysis – you’re not studying what you thinkyou’re studying
167
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignQuestionnaire Design
Questionnaire Design
There are two aspects to questionnaire designThe overall structure of the documentThe individual questions
168
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignQuestionnaire Design
Composing questions
The questions themselves must be clear and unambiguousExpress the questions in clear and simple languageDon’t use leading questions (Avoid “Isn’t it the case that X is a good idea?”; prefer“Do you think X is a good idea or a bad idea?”)Ask a single thing at a time (Avoid “Do you have a job, and if so how much do youearn?”)Avoid hypothetical questions: often don’t give useful information (e.g., “If you had agrant how much more money would you spend on drink?”)If you need to ask for information about others, restrict it to factual information(“What is your husband’s job?”, but not “What does your husband think about X?”)Make the questions easy to answer:
provide a set of options (perhaps on a prompt card)with amounts (time, money etc.) provide options help to precision (e.g., times perweek, ranges)
169
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignQuestionnaire Design
Document structure
The document must be structured simply and logicallyGroup questions in a way that will seem reasonable to the respondentUse clear routing:
5: Do you have a job?(if yes, ask qn 6, else skip to qn 7)6: What is your job?7: . . .
Reserve sensitive questions (e.g., income, drug use) to the end of thequestionnaire: less likely to scare off respondents there
170
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignQuestionnaire Design
Question structure
The structure of questions is also importantEach question must have its answer space: a tick-box, a space for writing, a set ofcategoriesClosed questions are preferred: a fixed set of options makes it easier to answer andto analyseAlways allow the category “Other” with closed questions, with space to write (“Ifother, please specify:”)
171
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignQuestionnaire Design
Types of closed question
Closed responses can take several forms:One only: “Tick the category that best describes your job”One or more: “Tick all of the following reasons that are relevant”Ranking: “Rank the importance for you of each of the following reasons (1, 2, 3etc.)”
172
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignQuestionnaire Design
Likert scales
Measurement of attitudes often uses “Likert” scales:a set of statements relevant to the attitude being measuredwith options indicating agreement:
Strongly disagreeDisagreeNeither agree nor disagreeAgreeStrongly agree
173
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignQuestionnaire Design
Practical organisation
The practical organisation of the questionnaire must take question structure intoaccount: easy to answer and easy to processProcessing involves several steps
Recording the information as it is being collectedChecking it is consistent and the right questions are answeredCoding it onto a computerChecking it is consistent once on computerAdding variable and value labels to help the computer data set to make sense
174
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignQuestionnaire Design
Design principles
The design of the questionnaire should keep the first three of these in mindIt should be easy to record the informationIt should be easy to read the recorded information to see that it makes sense, to seethat the correct routing has been followedThe layout of the questionnaire should anticipate the structure of the computer dataset
175
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignQuestionnaire Design
Example questionnaire
Example Questionnaire Extract Office use only1: Sex: Male 1 Female 2 q1
2: Age: 18 or under 1
19 – 23 2
24 – 30 3
31 – 40 4
41 – 50 5
51 – 64 6
65 or more 7
q2
176
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignQuestionnaire Design
Many options, tick one
3: Which of the following options best describes your current situation? (show card):Self-employed 1
Employed 2
Unemployed 3
Retired 4
Family care 6
Full time student, school 7
Long-term sick, disabled 8
Training scheme 9
Other 10
q3
If “Other” please specify:
177
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignQuestionnaire Design
Ordinal options (with a flaw)
4: How often do you read newspapers? Tick the category that is closest:Never 1 (Skip to question 6)More than 1 per day 2
1 every day 3
2–4 per week 4
1 per week 5
Less than 1 per week 6
q4
178
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignQuestionnaire Design
Picking zero or more from a list
5: Which of the following newspapers do you read? (show card) Tick each one thatapplies:Irish Times a q5a
Irish Independent b q5b
Irish Examiner c q5c
Limerick Leader d q5d
Sunday Independent e q5e
etc. etc. f q5f
Other g q5g
If “Other” please specify:
179
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignQuestionnaire Design
Attitude questions with Likert answers
6: People have different views about women’s role in society. Please indicate whetheryou agree or disagree with each of the following things people might say, using thecategories provided:
Stronglyagree
Agree Neitheragree nordisagree
Disagree Stronglydisagree
A woman’s place is in the home 1 2 3 4 5 q6a
Young children suffer if the motherworks outside the home
1 2 3 4 5 q6b
Women are entitled to the samepay for the same work as men
1 2 3 4 5 q6c
When a wife works it is importantfor the husband to do his share ofthe housework
1 2 3 4 5 q6d
180
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignQuestionnaire Design
Pilot your questionnaire
It is very important that a questionnaire work well once in the fieldTest it beforehand,
on friends/colleagueson a “pilot” subsample drawn from the reference population
Are the questions comprehensible?Are the categories in closed questions adequate? Do they cover everything? Makethe right distinctions? No very unlikely ones? Will they make sense to therespondent?
181
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignQuestionnaire Design
Piloting
Is the structure okay? Not too complicated or illogical?Is there anything likely to confuse the respondent or interviewer?Exactly how long will it take? If too long, shorten it now!Piloting will also allow you to create closed categories for questions, usingreal-world feedback rather than your imagination
182
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignDistribution and collection
Distribution and collection
How to choose a sample? Depends on research question and resourcesTypes of sample
Random sample from explicit sample frameMulti-stage random sampleQuota sampleAccidental/convenience sample
183
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignDistribution and collection
Samples
The first two are the most powerful – true random sample – and it is enormouslyimportant to get as many of the sample to respond – each individual countsWhen it is not possible to generate a sampling frame, quota or conveniencesamples may be worthwhileWith quota sampling, representativity is “enforced” by quota, and the emphasis ison filling the quotas as efficiently as possible
184
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignDistribution and collection
Convenience and snowball
With convenience sampling (e.g., snowball sampling), representativity isn’t possiblein the same way, so more emphasis on getting enough people, though still veryimportant to minimise refusals (very likely, people who refuse are different in someinteresting way)With quota and convenience sampling, also very important to make surerespondents fit the requirements – quite formal for quota, related to the researchproblem definition for convenience sampling
185
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignDistribution and collection
Accessing respondents
With a sample from a frame there is a clear identification of exactly whom tointerview, often with addresses – the work is then in arranging times suitable tothe respondent (and in a large survey, dividing respondents among interviewers)With quota or convenience sampling, some other strategy has to be found tocontact potential respondents, e.g., random telephone dialling or stopping peoplein the street for quota sampling; snowball sampling (asking respondents to tell youabout other people you could interview), working through groups or organisationsor locations etc. for convenience sampling
186
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignModes of delivery
Modes of delivery
How a questionnaire is delivered has consequences for how it is answered and whoanswers itWe can distinguish between face-to-face interviews and self-completionquestionnaire
187
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignModes of delivery
Face to face
Face to face (perhaps not literally) involve an interviewer mediating the questionsand the responses – e.g., s/he reads the questions, provides prompts andclarifications, and writes down the answers
Can be carried out by the researcheror a delegated interviewerCan be carried out on the phone
188
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignModes of delivery
Interviewer delivered
Questionnaires applied by an interviewer have certain advantages:Interviewer can clarify questionsCan also clarify ambiguous answersMore likely to get a complete responseMore likely to get a response in the first place (harder to refuse a person, easy toignore a piece of paper)
And disadvantages:Very costlyCertain sorts of information may be suppressed in a face-to-face interviewer that maybe written downAnswers possibly more likely to be slanted towards an expectation of what theinterviewer would like to hear
189
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignModes of delivery
Self completion
Self-completion questionnaires are completed by the respondent, perhaps undersupervision or perhaps in his/her own timeThey can be distributed to a sample (e.g., pupils in schools, workers inorganisations, by post to addresses etc.)They can also be distributed in an “opportunistic” way: handed out at an event(e.g., after mass, at a concert) or left in a public place (perhaps with somepublicity or an endorsement by someone “in authority” – e.g., the priest brings thecongregation’s attention to the fact that they are there)
190
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignModes of delivery
Response rate
Response rate is a big problem with self completionWith distribution to specific people (i.e., sample, though non-random) make surethere is a covering letter to the individual indicating that the research is valuable andthe data will be confidentialFollow up with reminders some time later to those who haven’t repliedWith “opportunistic” distribution, make sure the questionnaire makes the researchseem valuable, and guarantee confidentialityWith opportunistic distribution, you can’t issue personal reminders (ask the priest toremind the congregation a couple of weeks later, etc.)
191
SO5041: Quantitative Research MethodsUnit 12: Questionnaire DesignModes of delivery
Advantages and disadvantages
The advantages of self-completion aremore privacy for the respondentcheap and easy
The disadvantages are manyMore possibilities for misunderstandingNo control over meaningless, ambiguous or illegible answersMuch harder to control response rate: will be lowMore room for response bias to enter: only those with an axe to grind will botherreplying (in the extreme case) – for instance, people who have experienced sexualharassment in the workplace are far more likely to respond to a relevant survey thanthose who have not, so don’t use such a technique to estimate the prevalence ofsexual harassment
192
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testing
Unit 13: Hypothesis testing
193
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testing
Topics
Three important topics today:Hypothesis testingSignificancet-test for paired samples
193
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingHypothesis Testing
Confidence intervals assess imprecision
Confidence intervals allow us to present sample information appropriatelyPoint estimate, e.g., mean or other sample statistic: our “best guess” of the truevalueConfidence interval: range in which we are “confident” true value (the populationparameter) lies
In combination, the point estimate and CI give us the answer with a measure of itsprecision
194
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingHypothesis Testing
Hypothesis testing and CIs
Statistical inference proceeds by hypothesis testing: a formalisation of theConfidence Interval approachEasier to disprove something than to prove something trueThus if we wish to test if a variable has an effect on another (e.g., does a switchto a 4-day week change productivity) we set up a hypothesis, for instance
H1: Productivity after is not the same as productivity before, Xa 6= Xb
195
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingHypothesis Testing
Recent example
Microsoft recently experimented with a 4-day week in Tokyo:
https://www.theguardian.com/technology/2019/nov/04/microsoft-japan-four-day-work-week-productivity
They found productivity increased, as well as worker satisfaction
196
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingHypothesis Testing
Null hypothesis
To test this turn it around, and set up a null hypothesis that says the opposite:
H0: Productivity after is equal to productivity before, Xa = Xb
If we can reject the null hypothesis on the basis of our sample data, we can say thedata support the main hypothesis
197
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingHypothesis Testing
Claims about population
Hypothesis testing is a way of using the reasoning behind CIs to make specificclaims about the populationSay we want to know if there is a relationship between two variables, e.g., whetherswitching to a 4-day week has an effect on productivity (positive or negative)We look at a sample of workers to make inferences about the populationWe begin with a hypothesis: Xa 6= Xb or Xa > Xb
We negate the hypothesis to form a null hypothesis, called H0: H0 : Xa = Xb
which is equivalent to H0 : Dx = Xa − Xb = 0That is, on average, the productivity difference is zero
198
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingHypothesis Testing
Testing the null hypothesis
We then test the null hypothesis:First we calculate a sample mean productivity difference, Dx
Then we construct a confidence interval at our chosen level of confidence (e.g.,95%): Dx ± z0.95 × SEThe CI gives us a band around the point estimate within which we are 95% sure thetrue value liesIf zero lies outside the interval, we are at least 95% sure the true population value isnot zero, and we can reject the null hypothesisIf zero lies within the interval, then zero is in the range of plausible values, so wecannot reject the null hypothesisWe don’t say we "accept the null hypothesis" because zero is just in the range ofplausible values, and other values in this range are approximately as likely
199
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingHypothesis Testing
Reject or fail to reject
Rejecting the null hypothesis constitutes support for the initial or “alternative”hypothesisFailing to reject the null hypothesis means the data fail to support the initialhypothesis: “there is no evidence that the switch to a 4-day week affectsproductivity”Failure to support the initial hypothesis may be because
It is actually false, i.e., Dx = 0The effect is small and/or very variable, and thus the sample is too small to detect it
200
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingHypothesis Testing
Example: Productivity before and after an intervention
ID Before After1 3.5 4.62 1.8 2.33 2.4 2.54 3.3 4.45 1.7 2.16 3.7 5.17 4.4 5.68 3.4 4.19 1.8 1.4
10 2.3 1.7
201
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingHypothesis Testing
How do we “test” the null hypothesis?
In this example we calculate the difference in productivity before and after, in thesample dataSome may be negative, some may be positive, but we are interested in theaverage: is it systematically different from zero?Strategy: calculate the mean of the differences, and construct a CI around it (sayat 95%)If zero lies outside the CI, then we are at least 95% sure the true value lies in arange that does not include zeroIf zero within the CI, then the range within which we think the true value lies doesinclude zero
202
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingHypothesis Testing
Reject or not?
In the former case (zero outside interval), we can reject the null hypothesis: we areat least 95% sure that zero is not the true valueWe can therefore say the data support the initial hypothesis that the switch to a4-day week affects productivityIn the latter case (interval contains zero), we cannot reject the null hypothesis:zero is in the range where we feel the true value liesIn this case there is no evidence in the sample data of an effect of the 4-day weekon productivityThis is not the same as evidence of no effect!That zero lies within the CI is not the same as zero being the true value!
203
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingStatistical significance
Statistical significance
Significance: let’s say we do a hypothesis test with a 95% confidence level, and wefind the zero is way outside the CIWe can try again with a 99% confidence level:
If it is still outside the interval we are not “at least 95%” but “at least 99%” surethat zero is not the true value
If we keep trying with CIs with higher confidence levels we will eventually find onewhere zero is just outside the intervalIf that is at confidence level C we can say that we are C% sure (not “at least” anymore) that zero is not the true value
204
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingStatistical significance
The chance of being wrong?
There is then a C% chance that the true value is in the range that doesn’t includezero, or a 100% - C% chance that the true value is outside the CI, and thereforecould include zeroThis p = 100%-C% value is the probability that we get a sample statistic asdifferent from zero as we did, even though the true value was zeroThis is known as the significance of the sample estimate, or its p-valueWe want it to be as small as possible, typically under 5% (0.05)p-values are widely used – stats programs report them in many placesIn general the interpretation is “what’s the probability of getting this result bychance if the null hypothesis was true?”
205
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingStatistical significance
App: confidence intervals and significance
http://teaching.sociology.ul.ie:3838/apps/cislide
206
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingStatistical significance
t-test
Rather than use the CI we can set this up as a “t-test”We can find the t corresponding to the CI just touching zero thus:
t =X
SE
If that t is larger than the critical value, then the CI using the critical value issmaller and doesn’t overlap zeroThe significance is the exact p-value of that tThis example is a “paired sample t-test”
207
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingStatistical significance
t-test example in Stata
. gen diff = after-before
. ttest diff == 0One-sample t test
Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
diff 10 .5499999 .2166667 .6851601 .0598659 1.040134
mean = mean(diff) t = 2.5385Ho: mean = 0 degrees of freedom = 9
Ha: mean < 0 Ha: mean != 0 Ha: mean > 0Pr(T < t) = 0.9841 Pr(|T| > |t|) = 0.0318 Pr(T > t) = 0.0159
208
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingStatistical significance
t-test – paired
. ttest before==afterPaired t test
Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
before 10 2.83 .299648 .94757 2.152149 3.507851after 10 3.38 .4859812 1.536808 2.280634 4.479366
diff 10 -.5499999 .2166667 .6851601 -1.040134 -.0598659
mean(diff) = mean(before - after) t = -2.5385Ho: mean(diff) = 0 degrees of freedom = 9Ha: mean(diff) < 0 Ha: mean(diff) != 0 Ha: mean(diff) > 0Pr(T < t) = 0.0159 Pr(|T| > |t|) = 0.0318 Pr(T > t) = 0.9841
209
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingStatistical significance
χ2 test and significance
Another example of significance occurs in the χ2 (chi-sq) test for association in atableHere the initial hypothesis is that the two variables are associatedThus the null hypothesis is that they are not associated (the “independencehypothesis”)
When we calculate the χ2 statistic (∑ (O−E)2
E ) we compare its value with therange of possible values we would get if H0 were trueThis is what we read from the table of the χ2 distribution
210
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingStatistical significance
χ2 example in Stata
. tab race region, chirace of region of the united states
respondent north eas south eas west Total
white 582 307 375 1,264black 82 94 28 204other 15 14 20 49
Total 679 415 423 1,517Pearson chi2(4) = 53.1683 Pr = 0.000
211
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingStatistical significance
Signficance and error
Another way of looking at significance is “the chance we would be wrong if webelieve the initial hypothesis”For instance, if there is one chance in twenty (p = 0.05) that the true value isoutside the CI, then by basing our decision on the CI we will be wrong one time intwentyThis is known as Type I Error: rejecting the null hypothesis when it is true
e.g., the true value might be zero but a small number of possible samples generateCIs that don’t include zeroe.g., there may be no association but a small number of possible samples yield highχ2 statistics
212
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingStatistical significance
Type I and Type II error
If very important to avoid Type I error,use high confidence levels (e.g., 99.5%instead of 95%) or insist on lowp-values (e.g., 0.005 instead of 0.05)However, there is a second type oferror, Type II
Failing to reject the null hypothesiswhen it is false
That is, failing to support the initialhypothesis even though it is true
213
SO5041: Quantitative Research MethodsUnit 13: Hypothesis testingStatistical significance
Type I and Type II
If we raise the confidence level we reduce the risk of Type I error but raise the riskof Type II errorThat is, if we make a special effort not to accept an initial hypothesis unless thereis very clear evidence, we necessarily fail to accept it where there is only fairly clearevidenceFor a given p-value, we can only reduce the Type II error by increasing the samplesize
214
SO5041: Quantitative Research MethodsUnit 14: More $t$-tests
Unit 14: More $t$-tests
215
SO5041: Quantitative Research MethodsUnit 14: More $t$-tests
The basic t-test: sample versus reference-point
The simplest t-test compares a sample mean against a fixed number:
H0 : µ = Xr
t =X − Xr
SE
If t is bigger than the critical value for 95%, or its p-value is below 0.05, we rejectthe null hypothesis
215
SO5041: Quantitative Research MethodsUnit 14: More $t$-tests
Paired-sample t-test
Comparing “paired samples” is the same, with the diffence being compared withzero:
D = Xafter − Xbefore
H0 : δ = 0
t =D − 0SE
216
SO5041: Quantitative Research MethodsUnit 14: More $t$-tests
“t-test” for a proportion
Comparing a sample proportion against a fixed number such as 50% has a similarlogicIt doesn’t use the t-distribution, but can use standard normal for “large” samples
H0 : π = πr
z =p − πrSE
217
SO5041: Quantitative Research MethodsUnit 14: More $t$-tests
Directional hypotheses
A complication: some hypotheses are directionale.g., holidays make you happier, training raises your earning power
H1 : Wafter >Wbefore
Ho : Wafter ≤Wbefore
Ho : D ≤ 0
218
SO5041: Quantitative Research MethodsUnit 14: More $t$-tests
Direction ⇒ one-way t-test
If zero is below the CI, reject the null hypothesisIf zero is within or above the CI, cannot rejectNet result: higher confidence for the same CI – 1.96× SE for 97.5% not 95%
219
SO5041: Quantitative Research MethodsUnit 14: More $t$-testsComparing means across groups
Comparing means across groups
We don’t always have situations where we want to test something as simple aswhether the true answer is zeroHowever, we very often want to test whether a mean (e.g., income) is differentaccording to values of another variable (e.g., sex)If sex affects wage, we would expect to see the mean wage for men (Xm) to bedifferent from the mean wage for women (Xw )We can consider the sample difference (Xm − Xw ) to be a point estimate of thepopulation differenceThe null hypothesis is that Xm = Xw or Xm − Xw = 0
220
SO5041: Quantitative Research MethodsUnit 14: More $t$-testsComparing means across groups
Independent-samples t-test
To construct a CI we need the SE, which is very like the normal one if both groupshave the same population variance (or standard deviation):
√s2
n1+
s2
n2= s
√1n1
+1n2
If this cannot be assumed, the SE is more complex, and depends on the separatestandard deviations
√s21n1
+s22n2
221
SO5041: Quantitative Research MethodsUnit 14: More $t$-testsComparing means across groups
Two-sample t-test in Stata
. ttest grsearn, by(sex)Two-sample t test with equal variances
Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
male 953 25.89192 1.584105 48.90243 22.78318 29.00066female 978 27.92025 1.541657 48.21223 24.8949 30.94559
combined 1,931 26.91921 1.104885 48.5521 24.75232 29.08611
diff -2.028325 2.210045 -6.362652 2.306002
diff = mean(male) - mean(female) t = -0.9178Ho: diff = 0 degrees of freedom = 1929
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0Pr(T < t) = 0.1794 Pr(|T| > |t|) = 0.3589 Pr(T > t) = 0.8206
222
SO5041: Quantitative Research MethodsUnit 14: More $t$-testsComparing means across groups
Two-sample t-test, unequal variance
. ttest grsearn, by(sex) unequalTwo-sample t test with unequal variances
Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
male 953 25.89192 1.584105 48.90243 22.78318 29.00066female 978 27.92025 1.541657 48.21223 24.8949 30.94559
combined 1,931 26.91921 1.104885 48.5521 24.75232 29.08611
diff -2.028325 2.210451 -6.363455 2.306805
diff = mean(male) - mean(female) t = -0.9176Ho: diff = 0 Satterthwaite´s degrees of freedom = 1925.9
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0Pr(T < t) = 0.1795 Pr(|T| > |t|) = 0.3589 Pr(T > t) = 0.8205
223
SO5041: Quantitative Research MethodsUnit 14: More $t$-testsComparing means across groups
Key concepts
Hypothesis testNull hypothesisInitial or alternative hypothesis
Types I and II errorStatistical significance and p-valuest-tests:
Single value compared with referencePaired values compared with implicit zero referenceIndependent samples t-test: comparing two groups
Equal vs unequal variance
224
SO5041: Quantitative Research MethodsUnit 14: More $t$-testsSummarising inference
Summarising inference
For large samples, we use the normal distribution to construct confidence intervalsaround means of “quantitative” (interval/ratio) variablesFor small samples we use the t-distribution to construct the confidence intervalFor interval/ratio variables we usually estimate the mean, and use
SE =σ√n
=
√∑(X−X )2
n−1√n
225
SO5041: Quantitative Research MethodsUnit 14: More $t$-testsSummarising inference
Summarising inference: proportions
For nominal variables like vote, sex, etc. we calculate proportions, not means(where we split in two)With large samples (at least 20 in each category) we can construct confidence
intervals using the normal distribution and the formula σπ =√
p(1−p)n for the
standard errorWith this we can conduct hypothesis tests in exactly the same way as withinterval/ratio dataHowever, with small samples the approximation to the normal distribution nolonger holds and we have to use another distribution, the binomial distribution
226
SO5041: Quantitative Research MethodsUnit 14: More $t$-testsSummarising inference
Summarising inference: χ2 test
For nominal variables with more categories, and for tables made from nominalvariables, we can use the χ2 testAgain, this has “large sample” requirements – none of the expected values shouldbe < 5
227
SO5041: Quantitative Research MethodsUnit 15: Correlation
Unit 15: Correlation
228
SO5041: Quantitative Research MethodsUnit 15: Correlation
Remaining topics
From today we look at two related techniques for interval and ratio variables:Correlation: a single measure of association for interval/ratio variablesLinear Regression: a very powerful technique for describing the relationship betweenone interval/ratio variable and another
228
SO5041: Quantitative Research MethodsUnit 15: CorrelationMeasures of association for interval/ratio variables
Scatterplots
Scatterplots can be considered as interval/ratio analogue of cross-tabs: arbitrarilymany values mapped out in 2-dimensions
*Figure 1:* Age 47, income €6,535.
0
2000
4000
6000
8000
10000
0 20 40 60 80 100
Inco
me
Age
47 units from left
6535 unitsfrom bottom
229
SO5041: Quantitative Research MethodsUnit 15: CorrelationMeasures of association for interval/ratio variables
Age vs Income
*Figure 2:* Age and income, Wave 1 BHPS.
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
0 20 40 60 80 100
Inco
me
Age 230
SO5041: Quantitative Research MethodsUnit 15: CorrelationMeasures of association for interval/ratio variables
Summarising association simply
We can see a lot of detail in a scatterplot, but sometimes we can summarise it insimple waysFor instance the two variables may have a positive association: when one is highthe other tends to be high, and vice versaOr a negative association: when one is high the other tends to be low
231
SO5041: Quantitative Research MethodsUnit 15: CorrelationMeasures of association for interval/ratio variables
Strong positive
*Figure 3:* Fictional data displaying a strong *positive* linear relationship.
5
10
15
20
25
30
35
0 1 2 3 4 5 6 7 8 9
Y v
aria
ble
X variable
232
SO5041: Quantitative Research MethodsUnit 15: CorrelationMeasures of association for interval/ratio variables
Strong negative
*Figure 4:* Fictional data displaying a strong *negative* linear relationship.
0
5
10
15
20
25
0 1 2 3 4 5 6 7 8 9 10
Y v
aria
ble
X variable
233
SO5041: Quantitative Research MethodsUnit 15: CorrelationMeasures of association for interval/ratio variables
Weak negative
*Figure 5:* Fictional data displaying a *weak* negative linear relationship.
-10
-5
0
5
10
15
20
25
30
0 1 2 3 4 5 6 7 8 9 10
Y v
aria
ble
X variable
234
SO5041: Quantitative Research MethodsUnit 15: CorrelationMeasures of association for interval/ratio variables
No relationship
*Figure 6:* Fictional data displaying absence of a relationship.
0
2
4
6
8
10
12
14
16
18
20
0 1 2 3 4 5 6 7 8 9 10
Y v
aria
ble
X variable
0
2
4
6
8
10
0 5 10 15 20
Y v
aria
ble
X variable
235
SO5041: Quantitative Research MethodsUnit 15: CorrelationMeasures of association for interval/ratio variables
App
http://teaching.sociology.ul.ie:3838/so5041/corr
236
SO5041: Quantitative Research MethodsUnit 15: CorrelationMeasures of association for interval/ratio variables
The correlation coefficient
How does it work?Combines X deviations (Xi − X ) and Y deviations (Yi − Y ) – i.e., compares eachpoint with the mean for X and the mean for YWith positive association, cases below average on X tend to be below average onY (and above average on X tend to be above average on Y)With negative association, cases below average on X tend to be above average onY and vice versaWith positive association below the mean, both Xi − X and Yi − Y are negative,so (Xi − X )(Yi − Y ) is positiveWith negative association, Xi − X and Yi − Y tend to have opposite signs, so(Xi − X )(Yi − Y ) is negative
237
SO5041: Quantitative Research MethodsUnit 15: CorrelationMeasures of association for interval/ratio variables
Deviations
Figure 7: X and Y deviations for correlation
X − X_
Y − Y_
X
Y
X
Y_
_
238
SO5041: Quantitative Research MethodsUnit 15: CorrelationMeasures of association for interval/ratio variables
Pearson Product-Moment Correlation
CoefficientPearson Product-Moment Correlation Coefficient (r)
r =SXY√
SXX .SYY
SXX = Σ(X − X )2
SYY = Σ(Y − Y )2
SXY = Σ(X − X )(Y − Y )
Range: $ -1 ≤ r ≤ +1 $r is a symmetric measure: rxy = ryx
239
SO5041: Quantitative Research MethodsUnit 15: CorrelationMeasures of association for interval/ratio variables
Pitfalls
Correlation is not causalityAbsence of correlation is not absence of relationship: non-linearity
240
SO5041: Quantitative Research MethodsUnit 15: CorrelationMeasures of association for interval/ratio variables
Non-linear!
*Figure 8: * Fictional data displaying a strong *non-linear* relationship and a near-zerocorrelation.
-2
0
2
4
6
8
10
12
-3 -2 -1 0 1 2 3
t, t**2+invnorm(rand(1))
241
SO5041: Quantitative Research MethodsUnit 15: CorrelationMeasures of association for interval/ratio variables
Estimating correlations in Stata
. pwcorr ofimn ojbhgs ojbhrs, sigofimn ojbhgs ojbhrs
ofimn 1.0000
ojbhgs 0.4851 1.00000.0000
ojbhrs 0.4245 0.2489 1.00000.0000 0.0000
242
SO5041: Quantitative Research MethodsUnit 15: CorrelationMeasures of association for interval/ratio variables
Viewing the same correlations
243
SO5041: Quantitative Research MethodsUnit 16: Regression
Unit 16: Regression
244
SO5041: Quantitative Research MethodsUnit 16: Regression
Regression analysis
Regression Analysis: Fitting the “best” line through the scatterVery closely related to correlation, but treats one variable as dependent and theother(s) as explanatory, while correlation is asymmetric
244
SO5041: Quantitative Research MethodsUnit 16: Regression
Some geometry: equation of a line
Y = a + bX
0
0.5
1
1.5
2
0 0.5 1 1.5 2 2.5 3
Y v
aria
ble
X variable
X = 0; Y = 0.7
X = 1; Y = 1.0
Y = 0.7 + 0.3*X
245
SO5041: Quantitative Research MethodsUnit 16: Regression
Predictive in intent
Asymmetric: use X to predict YPRE: implied causality(Proportional Reduction in Error: if we know X, we can guess Y better)Find the best a and b to summarise the data scatter:
‘Best’ is defined as minimising the squared deviations between the observeddata-points and the fitted line, hence often called ‘least-squares’ regressionDeviations are the vertical distance between the line and the observed data points.Very similar logic to the mean (minimise variance).
246
SO5041: Quantitative Research MethodsUnit 16: Regression
Predicted values
The line gives a predicted value of Y for each value of X :
Y = a + bX
Y = Y + e
Y = a + bX + e
e is the ‘residual’ or deviation.
That is, knowing X we “predict” or guess Y as a + bX
In general this is more accurate that guessing Y as Y , the mean: “ProportionateReduction in Error”
247
SO5041: Quantitative Research MethodsUnit 16: Regression
Deviations from the line
248
SO5041: Quantitative Research MethodsUnit 16: Regression
Regression equation
Regression equation: the estimate of Y , called Y , depends on X :
Y = a + bX
The regression slope b depends on SXY and SXX , the intercept a is calculatedfrom b and the mean values of Y and X :
b =SXY
SXX
a = Y − bX
SXX =∑
(Xi − X )2
SXY =∑
(Xi − X )(Yi − Y )
249
SO5041: Quantitative Research MethodsUnit 16: Regression
Pitfalls
Like correlation, non-linear relationships may be missedSpurious relationships will fit just as well as real ones (e.g., if A affects B and Aaffects C, B and C will seem to be related and a regression line might fit well)Predicting outside the range of the data: the relationship we see only holds for thedata we use, and it may well not hold for higher (or lower) values of X and Y
250
SO5041: Quantitative Research MethodsUnit 16: Regression
Fit
How well does it “fit”? We use R2 to tell:ranges from 0: no relationship at allto 1: perfect relationship, all Y s are exactly equal to a + bXvalues from 0.7 up indicate quite a good relationshipsmaller values may indicate an interesting relationship
In the case of bivariate regression (one independent variable), R2 is the same asr × r (squared correlation coefficient).
251
SO5041: Quantitative Research MethodsUnit 16: Regression
Regression in Stata:
. reg ofimn ojbhrsSource SS df MS Number of obs = 7,945
F(1, 7943) = 1398.95Model 1.7000e+09 1 1.7000e+09 Prob > F = 0.0000
Residual 9.6522e+09 7,943 1215179.2 R-squared = 0.1497Adj R-squared = 0.1496
Total 1.1352e+10 7,944 1429021.17 Root MSE = 1102.4
ofimn Coef. Std. Err. t P>|t| [95% Conf. Interval]
ojbhrs 39.34202 1.051854 37.40 0.000 37.28011 41.40393_cons 434.7389 36.8029 11.81 0.000 362.5955 506.8822
252
SO5041: Quantitative Research MethodsUnit 16: Regression
Predicted regression line
253
SO5041: Quantitative Research MethodsUnit 16: RegressionMultiple Regression
Multiple explanatory variables
Regression analysis can be extended to the case where there is more than oneexplanatory variable – multivariate regressionThis allows us to estimate the net simultaneous effect of many variables, and thusto begin to disentangle more complex relationshipsInterpretation is relatively easy: each variable gets its own slope coefficient,standard error and significanceThe slope coefficient is the effect on the dependent variable of a 1 unit change inthe explanatory variable, while taking account of the other variables
254
SO5041: Quantitative Research MethodsUnit 16: RegressionMultiple Regression
Example
Example: domestic work time may be affected by gender, and also by paid worktime: competing explanations – one or the other, or both could have effectsWe can fit bivariate regressions:
DWT = a + b × PaidWork
orDWT = a + b × Female
We can also fit a single multivariate regression
DWT = a + b × PaidWork + c × Female
255
SO5041: Quantitative Research MethodsUnit 16: RegressionMultiple Regression
Dichotomous variables
We deal with gender in a special way: this is a binary or dichotomous variable –has two valuesWe turn it into a yes/no or 0/1 variable – e.g., female or notIf we put this in as an explanatory variable a one unit change in the explanatoryvariable is the difference between being male and femaleThus the c coefficient we get in the DWT = a + b × PaidWork + c × Femaleregression is the net change in predicted domestic work time for females, once youtake account of paid work time.The b coefficient is then the net effect of a unit change in paid work time, onceyou take gender into account.
256
SO5041: Quantitative Research MethodsUnit 16: RegressionMultiple Regression
Sex only predicting income
. reg ofimn i.osexSource SS df MS Number of obs = 7,945
F(1, 7943) = 606.72Model 805586626 1 805586626 Prob > F = 0.0000
Residual 1.0547e+10 7,943 1327780.12 R-squared = 0.0710Adj R-squared = 0.0708
Total 1.1352e+10 7,944 1429021.17 Root MSE = 1152.3
ofimn Coef. Std. Err. t P>|t| [95% Conf. Interval]
osexfemale -637.3352 25.87467 -24.63 0.000 -688.0563 -586.614
_cons 2062.275 18.64855 110.59 0.000 2025.719 2098.831
257
SO5041: Quantitative Research MethodsUnit 16: RegressionMultiple Regression
Sex and job hours predicting income
. reg ofimn ojbhrs i.osexSource SS df MS Number of obs = 7,945
F(2, 7942) = 794.96Model 1.8935e+09 2 946761687 Prob > F = 0.0000
Residual 9.4586e+09 7,942 1190962.07 R-squared = 0.1668Adj R-squared = 0.1666
Total 1.1352e+10 7,944 1429021.17 Root MSE = 1091.3
ofimn Coef. Std. Err. t P>|t| [95% Conf. Interval]
ojbhrs 33.96065 1.123629 30.22 0.000 31.75804 36.16326
osexfemale -337.0889 26.44232 -12.75 0.000 -388.9228 -285.255
_cons 787.1759 45.73595 17.21 0.000 697.5214 876.8304
258
SO5041: Quantitative Research MethodsUnit 16: RegressionMultiple Regression
Sex and hours
259