Lesson 6 - Topics Reading SAS datasets Subsetting SAS datasets Merging SAS datasets.
Challenges and strategies when exploiting data on ethnicity from social survey datasets
description
Transcript of Challenges and strategies when exploiting data on ethnicity from social survey datasets
Challenges and strategies when exploiting data on ethnicity from
social survey datasets
Paul Lambert, University of Stirling
Talk presented to the NCRM seminar ‘What is ethnicity? What methods best capture it?’, part of the NCRM series ‘’Promoting methodological innovation and capacity building in research on ethnicity’, University of Essex, 14 th May 2010.
This work draws upon materials from the GEMDE project, a component of DAMES (www.dames.org.uk), an ESRC funded research Node working on ‘Data
Management through e-Social Science’
Summary of claims
1) Well known challenges exploiting survey measures of ethnicity
..our response is usually too conservative..
2) Better ‘data management’ could/should allow us to get much more from data
Take account of more precise ethnic differencesLongitudinal/cross-national comparisonsComplex multivariate models, interaction effects
3) We have something to offer here: ‘GEMDE’
2
…why is working with ethnicity data in surveys so hard…?
- It’s sparse - It’s collinear (e.g. to age)
- It’s dynamic (cf. comparative research)
3
4
Data includes: Generic & specialist studies collecting ethnic ‘referents’
‘ethnic identity’; nationality, parents’ nationality; country of birth; language spoken; religion; ‘race’
National research: Most countries have evolving standard definitions of ethnic groups,
though not all surveys follow them• Some surveys cover large numbers from many/all groups• Most surveys only have sparse representation of most groups
Comparative research (international/longitudinal) : Seen as highly problematic in many fields except immigration studies
o Lambert, P.S. (2005). Ethnicity and the Comparative Analysis of Contemporary Survey Data. In J. H. P. Hoffmeyer-Zlotnick & J. Harkness (Eds.), Methodological Aspects in Cross-National Research (pp. 259-277). Manheim: ZUMA-Nachrichten Spezial 11.
5
Ethnic group in the World Values Survey - Britain
Count
18 0 0 0 18
0 1484 0 999 2483
0 0 1 0 1
15 0 0 0 15
1 0 0 0 1
0 0 3 0 3
0 0 11 0 11
0 0 1 0 1
0 0 4 0 4
0 0 12 0 12
9 0 2 0 11
0 0 7 0 7
1124 0 1044 0 2168
0 0 8 0 8
1167 1484 1093 999 4743
-5 Missing; Unknown
-4 Not asked
-1 Don´t know
40 Asian
70 Asian - Central (Arabic)
80 Asian - East (Chinese,Japanese)
90 Asian - South (Indian,Hindu, Pakistani,Bangladeshi)
130 Bangladeshi
200 Black African
210 Black-Caribbean
220 Black-Other / Black
810 Pakistani
1400 White / CaucasianWhite
8000 Other
Total
1981-1984 1989-1993 1994-1999 1999-2004
Wave
Total
He said that ‘our response is usually too conservative’? I’m not conservative!
Social theory is dynamic, fluid, ‘intersectional’, but representative empirical analyses struggles to engage with its terms
Empirical studies are bivariate; descriptive; use low numbers of groups & normalising assumptions
This is ‘conservative’ because..o Administrative pressure to reify descriptive groups o Analyses simplify, or ignore, rather than incorporate, extra
information on ethnic locations (e.g. language, religion)o Analytical results tend to be easily anticipated (basic
descriptions, ignoring complex collinear contexts)6
7
2) Data management for categorical data
Principal social survey datumo Basis of most social research reports/analyses/comparisons
It’s rich and complex o We’re often interested in very fine levels of detail / differenceo We usually recode categories in some way for analysis
…how categorical data is managed is of great consequence to the results of analysis…Choices about recoding, boundaries, contrasts made[e.g. RAE analysis: Lambert & Gayle 2009]
Management itself influences analytical approaches
8
EFFNATIS sample (1999): Subjective ethnic identity
30. English, White, Cosmopolitan (2,10, 1 0.12 86.0329. English, European and White-British 20 2.43 85.91 28. European and White-British (11,13) 7 0.85 83.48 27. White-Italian 1 0.12 82.62 26. English & White-British (2,13) 8 0.97 82.5025. White-British & Cosmopolitan (13,16 5 0.61 81.53 24. English, White-British (2,13) 41 4.98 80.92 23. English & White (2,10) 54 6.56 75.9422. Indian-British, Asian-British (7,15 9 1.09 69.38 21. Indian & British (1,4) 1 0.12 68.29 19. British Moslem 6 0.73 68.17 18. Indian, Asian (4,9) 2 0.24 67.44 17. Moslem 12 1.46 67.19 16. Cosmopolitan 7 0.85 65.74 15. Asian-British 65 7.90 64.88 13. White-British 60 7.29 56.99 11. European 6 0.73 49.70 10. White 7 0.85 48.97 9. Asian 6 0.73 48.12 8. Bangladeshi-British 22 2.67 47.39 7. Indian-British 34 4.13 44.71 6. Pakistani-British 73 8.87 40.58 5. Bangladeshi 10 1.22 31.71 4. Indian 10 1.22 30.50 3. Pakistani 25 3.04 29.28 2. English 79 9.60 26.25 1. British 137 16.65 16.65 yourself? Freq. Percent Cum. describes how you would describe Q.129 Which of the following best
90. White-British, Ukrainian 1 0.12 98.91 89. Indian, Asian-British (4,15) 1 0.12 98.7888. Indian, Black, Asian-British (4,12, 1 0.12 98.6687. Indian-British, Black-British (7,14 1 0.12 98.54 86. English, Indian (2,4) 4 0.49 98.42 85. European, White-British, Irish 1 0.12 97.93 83. Pakistani-British, Italian 1 0.12 97.8182. English, Pakistani-British, Asian ( 1 0.12 97.69 80. Human being 2 0.24 97.5779. English, Indian-British, White, Asi 1 0.12 97.33 78. Pakistani-British, Asian, Moslem 1 0.12 97.21 77. English, Pakistani (2,3) 3 0.36 97.08 73. British with Baltic-Slav origins 1 0.12 96.7272. English, White-British, Cosmopolita 2 0.24 96.60 71. Irish-English 1 0.12 96.35 70. English, White, European (2,10,11) 3 0.36 96.23 69. English, White, Irish 1 0.12 95.8768. Pakistani, Asian-British, Moslem, F 1 0.12 95.75 67. Pakistani, Asian-British (3,15) 1 0.12 95.63 66. Humanoid 1 0.12 95.50 65. Lancastrian 3 0.36 95.38 64. British, European (1,11) 2 0.24 95.02 63. Bangladeshi, Asian, Black (5,9,12) 1 0.12 94.7862. Pakistani-British, Black, Asian-Bri 2 0.24 94.65 61. English, Pakistani-British (2,6) 2 0.24 94.4160. British, White, European, Cosmopoli 1 0.12 94.1759. English, European, White-British, C 3 0.36 94.05 58. Individual 2 0.24 93.68 56. English, European (2,11) 2 0.24 93.44 55. Scottish 1 0.12 93.20 54. British, White (1,10) 2 0.24 93.07 53. British, English (1,2) 9 1.09 92.8352. English, Indian, Asian-British (2,4 1 0.12 91.7451. Pakistani-British, Indian-British ( 1 0.12 91.62 50. English, Asian-British (2,15) 2 0.24 91.49 49. English, Indian-British (2,7) 2 0.24 91.2548. English, Indian-British, Black, Asi 2 0.24 91.01 47. Indian-British, Asian (7,9) 5 0.61 90.7746. Indian, Asian, Asian-British (4,9,1 1 0.12 90.16 45. Neapolitan 1 0.12 90.0444. English, Pakistani-British, Indian- 2 0.24 89.9143. English, Indian-British, Asian-Brit 1 0.12 89.67 42. Pakistani-British, Asian (6,9) 6 0.73 89.55 41. Black, Asian-British (12,15) 2 0.24 88.8240. Pakistani-British, Asian-British (6 2 0.24 88.58 39. Pakistani-born British 1 0.12 88.34 38. Pakistani, Black (3,12) 1 0.12 88.2137. Pakistani-British, Asian, Black-Bri 1 0.12 88.0936. Pakistani-British, Black-British, A 1 0.12 87.9735. Pakistani-Moslem living in Great Br 1 0.12 87.85 34. Kashmiri 4 0.49 87.73 33. Pakistani, Asian (3,9) 3 0.36 87.2432. Pakistani-British, Asian-British (6 6 0.73 86.8831. English, Indian-British, Asian-Brit 1 0.12 86.15
9
UK EFFNATIS survey (1999)
[Heckmann et al 2001];
[Penn & Lambert 2009]
79. Islamic 0 0 1 0 1 78. Dramatic 0 0 4 0 4 77. Bollywood 0 0 1 0 1 75. Ambient 1 0 0 0 1 72. Lo-fi 1 0 0 0 1 68. Urdu 0 2 0 0 2 65. Hindi 0 1 2 0 3 64. Classical Hindi 0 1 1 0 2 63. English 0 2 0 0 2 60. Dance Floor 2 0 0 0 2 51. Goth 1 0 0 0 1 50. Break Beat 1 1 0 0 2 48. Swing 0 1 0 0 1 47. Blues 1 0 0 0 1 46. Love songs / ball 0 2 0 0 2 45. Rock 'n' Roll 1 0 0 0 1 44. Soft Rock 2 1 0 0 3 43. Alternative 2 0 0 0 2 41. Drum 'n' Bass 3 0 0 0 3 40. All 19 5 2 0 26 38. Asian Pop Music 0 2 0 0 2 37. Jungle 1 0 0 0 1 36. Bangra 0 5 2 0 7 35. Indian 0 11 3 1 15 31. Punk 3 0 0 0 3 30. Country 2 0 0 0 2 28. Reggae 0 3 2 1 6 27. Easy Listening 3 0 0 0 3 26. Grunge 1 0 0 0 1 24. Eighties 2 0 0 0 2 23. Seventies 3 0 0 0 3 22. Sixties 4 0 0 0 4 21. Classical 4 2 1 1 8 20. Jazz 2 1 1 0 4 18. Folk 1 0 0 0 1 17. Motown 1 0 0 0 1 16. Rock / Heavy Meta 32 2 2 0 36 15. Indie / Britpop 46 4 3 1 54 14. Rap 4 11 2 1 18 13. Acid Jazz 1 0 0 0 1 12. Hardcore 3 0 0 0 3 11. Garage 0 2 1 0 3 10. Techno 1 0 0 0 1 9. House 18 0 1 0 19 8. Trance 0 0 1 0 1 7. Asian Music 0 11 12 2 25 6. Hip Hop 5 6 2 0 13 5. R'n'B 19 16 5 5 45 4. Soul 8 31 10 3 52 3. Dance 94 5 9 1 109 2. Chart 27 2 0 0 29 1. Pop 107 32 20 4 163 music1b 1. Autoch 2. Pakist 3. Indian 4. Bangla Total
Total 97 601 698 4. Bangladeshi 6 14 20 3. Indian 32 56 88 2. Pakistani 58 105 163 1. Autochthonous 1 426 427 uketh2 1. Ethnic 2. No EM Total music, by ethnicity Favourite type of
10
A ‘data management’ contribution?
Preserve information on what was done with categorical data Communicate information on what should/could be done
11
Standardizing categorical data
‘Standardization’ refers to treating variables for the purposes of analysis, in order to aid comparison between variables
o {In the terminology of survey research analysts}
1. Arithmetic standardization to re-scale metric values [zi = (xi – x) / sd]
2. Ex-ante or Ex-post harmonisation [during data production, or adaptation after the event]
3. Measurement or Meaning/Functional equivalence [Much comparative research flounders on the apparent impossibility of measurement equivalence and lack of options for functional equivalence, e.g. Van Deth, 2003]
‘One size doesn’t fit all so we can’t go on’
12
Meaning equivalence
For categorical data, equivalence for comparisons is often best approached in terms of meaning equivalence
(because of non-linear relations between categories and shifting underlying distributions)
(even if measurement equivalence seems possible)
Arithmetic standardisation offers a convenient form of meaning equivalence by indicating relative position with the structure defined by the current context
For categorical data, this can be achieved/approximated by scaling categories in one or more dimension of difference
13
Managers and Administrators
Professional
Associate professional and technical
Clerical and secretarial
Craft and related
Personal and protective servicesSales
Plant and machine operativesOther occupations
.
higher degree
first degree
teaching qf
other higher qf
nursing qf
gce a levels
gce o levels or equiv
commercial qf, no o levels
cse grade 2-5,scot grade 4-5apprenticeship
other qf
no qf
.white
black-carib
black-african
black-other
indianpakistani
bangladeshi
chinese
other ethnic grp
2030
4050
0 1 2 3Source: British Household Panel Survey 2007, adults aged 18+ and father's Cambridge Scale score.Points at 1-3 show category mean. Points at 0 show individual values (scaled mean=28, sd=6; pop. mean=28, sd=18).
‘Effect proportional scaling’ using parents’ occupational advantage
What was that then?
We can represent categories through positions on a scale In turn, we can use position in the dimension as a category
score which then plugs into a further analysis (e.g. regression main and interaction effects)
..Some options for data on ethnicity.. Stereotyped Ordered Logistic Regression (SOR) models, summarize
dimensions of difference according to regression predictor values
[e.g. Lambert and Penn, 2001]
Geometric data analysis for distances between people, or things [cf. Prandy, 1979; Bennett et al., 2009]
Assign category scores by hand (a priori or by selected average)
14
15
WhB-a
WhO-b
Ind-c
WhB-c
PkB-a
WhO-c
WhI-c
WhB-b
WhO-a
Oth-b
BA-b
BC-b
Chi-bPkB-bBC-a
BC-c
Ind-b
Chi-c
Oth-a
PkB-c
BA-a
WhI-a
Ind-a
WhI-b
Oth-c
BA-cChi-a
I/II
IIIaIVabcV/VIVII/IIIb
UnemployedInactive
-2-1
01
2D
imen
sion
2 (
22.
1%)
-2.5 -2 -1.5 -1 -.5 0 .5Dimension 1 (58.4%)
a = Born in UK; b = Came to UK before 1970; c = came to UK 1970 or laterN=640295 (Data: Li and Heath, 2008)
LFS pooled data for men, 1991-2005Correspondence analysis dimension scores
UK born Not UK born
Whi Iri 0 (base)
Whi Other -5.4 -3.4
Black Car. -9.6 -1.0
Black Afr -12.7 -5.9
Indian -16.1 -1.0
Pak/Bang. -16.8 -3.2
Chinese -15.8 -6.5
Other -11.0 -3.916
educ6_6 -.2346746 .0869549 -2.70 0.007 -.4051031 -.0642461 educ6_5 -.3082588 .1122945 -2.75 0.006 -.528352 -.0881655 educ6_4 -.2971853 .1083325 -2.74 0.006 -.5095131 -.0848574 educ6_3 -.4110193 .1492615 -2.75 0.006 -.7035665 -.1184722 educ6_1 -.0771515 .0341444 -2.26 0.024 -.1440733 -.0102298 yob -.0298194 .0107592 -2.77 0.006 -.0509071 -.0087317 ethim2 Coef. Std. Err. z P>|z| [95% Conf. Interval] ( 1) [phi1_1]_cons = 1
Log likelihood = -47960.017 Prob > chi2 = 0.2616 Wald chi2(6) = 7.69Stereotype logistic regression Number of obs = 21999
Is scaling useful?
..sometimes.. Intrinsically revealing as an exploratory exercise
Parsimonious functional form in explanatory modelling
o Esp. if ethnicity is a control variableo If interaction effects are consideredo If a story of a linear functional form is persuasive (e.g.
exponential increase)
17
Predicting poor subjective health, BHPS w15
18
legend: * p<0.05; ** p<0.01; *** p<0.001 bic 32364 32409 32468 32219 32363 32362 32485 ll -16154 -16139 -16131 -16072 -16148 -16143 -16134 r2 .071 .073 .074 .073 .072 .072 .074 N 12294 12294 12294 12239 12294 12294 12294 _cons 1.9*** 1.9*** 1.9*** 1.9*** 1.9*** 1.7*** 1.9*** _Iethim_29 .36* _Iethim_28 -.064 _Iethim_27 .96** _Iethim_26 .27 _Iethim_25 .32** _Iethim_24 .65 _Iethim_23 -.023 _Iethim_22 .66** _Iethim_21 -.057 _Iethim_19 .34 _Iethim_18 -.055 _Iethim_17 .44 _Iethim_16 -.066 _Iethim_15 .29 _Iethim_14 .044 _Iethim_13 .074 _Iethim_12 .0048 ethsa -.0011** ethscore -.018** .03 _Irace2_3 .26*** _Irace2_2 .19 _IracXoage_9 -5.2e-05 _IracXoage_8 -.0054 _IracXoage_7 .057** _IracXoage_6 .017 _IracXoage_5 .0057 _IracXoage_4 .022 _IracXoage_3 -.0029 _IracXoage_2 .016 _Irace_9 .35** .35 _Irace_8 -.06 .18 _Irace_7 .83** -1.1 _Irace_6 .14 -.54 _Irace_5 .31** .064 _Irace_4 .18 -.6 _Irace_3 .023 .14 _Irace_2 .29 -.42 educ4_4 .31*** .3*** .3*** .3*** .3*** .3*** .3*** educ4_2 -.1*** -.1*** -.1*** -.1*** -.1*** -.1*** -.1*** educ4_1 -.24*** -.25*** -.25*** -.25*** -.25*** -.25*** -.25*** oage .0062*** .0063*** .0062*** .0063*** .0062*** .011*** .0063*** fem .075*** .077*** .077*** .079*** .075*** .075*** .077*** Variable m1 m2 m2b m3 m5 m5b m4
19
What we do and what we ought to do
Research applications tend to select a single simplifying collinear categorisation of a concept Due to coordinated instructions [e.g. Blossfeld et al. 2006] Due to perceived lack of available alternatives Due to perceived convenience
To make statistical analyses more robust we should… Operationalise and deploy various scalings and arithmetic
measures Try out various categorisations and explore their distributional
properties … and keep a replicable trail of all these activities..
3) Some contributions from DAMES
3 themes in DAMES ought, in our perspective, to help here
1)Replicability / transparency
2)Plurality of approaches
3)Ease of access (to off-putting operations)
20
Replicability / transparency
Document your own recodes Access somebody else’s recodes Identify commonly used recodes (& use them..!)
21
Plurality of approaches Diminishing excuses for not trying out
multiple operationalisations…
22
0.0
5.1
.15
ES5
ES2E9
E6E5
E3E2
G13G11
G10G7
G5G3
G2K4
R7WR
WR9O17
O8O4
MNI9
I99CM
CFCM2
CF2CG
ISEISIOP
AWMWG1
WG2WG3
GN1
Increase in R-squared Increase in BIC
Britain-.
050
.05
.1.1
5
ES5
ES2E9
E6E5
E3E2
G13G11
G10G7
G5G3
G2K4
R7WR
WR9O17
O8O4
MNI9
I99CM
CFCM2
CF2CG
ISEISIOP
AWMWG1
WG2WG3
GN1
Sweden
Source: BHPS and LNU 1991, adults aged 23-55 in work in 1991, N=4536 Britain, 2504 Sweden. Model 1: ISEI = linear age + gender ; Model 2: ISEI = (Model 1) + occupation-based social classificationGraph shows improvement in R2 for OLS regression, Model 2 v's Model 1,plus scaled BIC statistic (Model 2 BIC - Model 1 BIC / Model 1 BIC). Unweighted data.
Explanatory power of schemes in predicting father's ISEI
Figure 12: R-2 and BIC for Fathers ISEI
Making complex things easier
Organising complex categorical dataLabelling, recoding, etc
Effect proportional scalingStandardisation Interaction terms
23
1. White
2. Mixed 3. Indian
5. Bangladeshi
6. Other Asian
7. Black-Caribbean
8. Black African
9. Other Black
10. Chinese
11. Other ethnic group
4. Pakistani
-2-1
01
2
Source: BHPS wave 17, n = 12626, % 'White' = 97.3
Identified principally by age, gender attitudes and household incomeSOR model dimension scores for BHPS ethnic groups
24
GESDE: Grid Enabled Specialist Data Environments
Facilities for collecting together, and distributing, specialist data resourcesOccupations: GEODE project began 2005Education and Ethnicity: GEEDE and GEMDE began
Feb. 2008
Capacity building aims: improving use of measures of these concepts by improving access to relevant information providing training / advice on good practice
25
GEODE: Organising and distributing specialist data resources (on occupations)
The GEODE model for GEMDE? Occupational Information Resources Occupational Unit Groups
26
Our approach to GEMDE
….A service for MUGs and MIRs…
o Define/register ‘Minority Unit Groups’
o Define/register ‘Minority Information Resources’
o Explore data resources and obtain help in approaching analysis of complex, sparse data
What's a MIR? 'Minority Information Resource'.
o This is our own terminology. By a MIR, we mean any piece of information which supplies systematic data on a minority unit group (MUG) classification. We've used this term to be deliberately similar to the phrase 'Occupational Information Resources' that we used on GEODE
E.g. summary statistical data about the categories from and documentation or information
E.g. recodings which have been used in a particular studyo Social scientists are not in general aware of the existence of MIRs (cf. wides
use of popular Occupational Information Resources). In GEMDE we seek to publicise little know resources and promote their uptake: We argue that better communication and dissemination of MIRs is in fact an important step towards better scientific practice of replication and standardisation of research.
In our terms, every MIR necessarily links to a MUG (but not every MUG has a MIR).
The GEMDE prototype‘Liferay portal’ with access to MUGs and MIRs
Current facilities
Shibboleth accessDeposit MUGs/MIRsSearch/browse
deposited resourcesFeedback on
resources (user ratings)
Still to come
Additional guest access
Review live data (e.g. pooled LFS records)
Expert and user quality ratings
=> …development over 2010...
Screenshot here!
31
32
Data used Department for Education and Employment. (1997). Family and Working Lives
Survey, 1994-1995 [computer file]. Colchester, Essex: UK Data Archive [distributor], SN: 3704.
Heckmann, F., Penn, R. D., & Schnapper, D. (Eds.). (2001). Effectiveness of National Integration Strategies Towards Second Generation Migrant Youth in a Comparative Perspective - EFFNATIS. Bamberg: European Forum for Migration Studies, University of Bamberg.
Inglehart, R. (2000). World Values Surveys and European Values Surveys 1981-4, 1990-3, 1995-7 [Computer file] (Vol. 2000). Ann Arbor, MI: Institute for Social Research [Producer]; Inter-university Consortium for Political and Social Research [Distributor].
Li, Y., & Heath, A. F. (2008). Socio-Economic Position and Political Support of Black and Ethnic Minority Groups in the United Kingdom, 1972-2005 [computer file]. 2nd Edition. Colchester, Essex: UK Data Archive [distributor], SN: 5666.
University of Essex, & Institute for Social and Economic Research. (2009). British Household Panel Survey: Waves 1-17, 1991-2008 [computer file], 5th Edition. Colchester, Essex: UK Data Archive [distributor], March 2009, SN 5151.
33
References Blossfeld, H. P., Mills, M., & Bernardi, F. (Eds.). (2006). Globalization, Uncertainty and
Men's Careers: An International Comparison. Cheltenham: Edward Elgar. Bennett, T., Savage, M., Silva, E. B., Warde, A., Gayo-Cal, M., Wright, D., et al. (2009).
Culture, Class, Distinction. London: Routledge. Lambert, P. S., & Gayle, V. (2009). Data management and standardisation: A
methodological comment on using results from the UK Research Assessment Exercise 2008. Stirling: University of Stirling, Technical paper 2008-3 of the Data Management through e-Social Science research Node (www.dames.org.uk)
Lambert, P. S., & Penn, R. D. (2001). SOR models and Ethnicity data in LIS and LES : Country by Country Report. Syracuse University, Syracuse, New York 13244-1020: Luxembourg Income Study Paper No. 260, Maxwell School of Citizenship and Public Affairs.
Penn, R. D., & Lambert, P. S. (2009). Children of International Migrants in Europe: Comparative Perspectives. Basingstoke: Palgrave.
Prandy, K. (1979). Ethnic discrimination in employment and housing. Ethnic and Racial Studies, 2(1), 66-79.
Simpson, L., & Akinwale, B. (2006). Quantifying Stablity and Change in Ethnic Group. Manchester: University of Manchester, CCSR Working Paper 2006-05.
van Deth, J. W. (2003). Using Published Survey Data. In J. A. Harkness, F. J. R. van de Vijver & P. P. Mohler (Eds.), Cross-Cultural Survey Methods (pp. 329-346). New York: Wiley.