Review and consultation: Next steps in supporting data on ethnicity

DAMES workshop on ‘Data on ethnicity in social survey research’, 28th January 2010, University of Stirling

Review and consultation: Next steps in supporting data on

ethnicity

Some preliminary comments: i. E-Social Science

ii. Challenges/principles

iii. Ethnicity research agendas

Further comments/discussions/questions

2

3

i) What makes this ‘e-Social Science’?

Attention to data management in context of.. Standards setting Metadata Portal framework

Liferay portal to various DAMES resources

iRODS system for ‘GE*DE’ specialist data

Controlled data access under security limits

Use of workflows

4

‘Data Management’

‘the tasks associated with linking related data resources, with coding and re-coding data in a consistent manner, and with accessing

related data resources and combining them within the process of analysis’ […the DAMES Node..]

Usually performed by social scientists (post-release)Most overt in quantitative survey data analysis Usually a substantial component of the work process

Here we differentiate from archiving / controlling data itselfHere we differentiate from archiving / controlling data itself

5

‘Data Management though e-Social Science’

DAMES – www.dames.org.uk

ESRC Node funded 2008-2011

Aim: Useful social science provisionsSpecialist data topics – occupations; education

qualifications; ethnicity; social care; health Mainstream packages and accessible resources Engage with existing provisions (e.g. ESDS; CESSDA)

Programme of case studies and provisions – more later

6

‘The significance of data management for social survey research’

Data management is a major component of the social survey research workload

Pre-release manipulations performed by distributors / archivists• Coding measures into standard categories; Dealing with missing records

Post-release manipulations performed by researchers • Re-coding measures into simple categories• All serious researchers perform extended post-release management (and have the scars to show for it)

We do have existing tools, facilities and expert experience to help us…but we don’t make a good job of using them efficiently or consistently

So the ‘significance’ of DM is about how much better research might be if we did things more effectively…

7

Data Management through e-Social Science www.dames.org.uk

1.1) Grid Enabled Specialist Data Environments (‘GE*DE’)

2.1) Description, discovery & service use through metadata and data abstraction

1.2) Data resources for micro-simulation on social care data

2.2) Techniques to handle data from multiple sources

1.3) Linking e-Health and social science databases

2.3) Workflow modelling for social science

1.4) Training and interfaces for management of complex survey data

2.4) Security driven data management

8

E.g. of GEODE: Organising and distributing specialist data resources (on occupations)

9

Challenges/principles

Data manipulation skills and inertia

I would speculate that around 80% of applications using key variables don’t consult literature and evaluate alternative measures, but choose the first convenient and/or accessible variable in the dataset Data supply decisions (‘what is on the archive version’) are critical

Much of the explanation lies with lack of confidence in data manipulation / linking data

Too many under-used resources – cf. www.esds.ac.uk

10

Software issues

Stata seems to be the superior package for secondary survey data analysis:

o Advanced data management and data analysis functionalityo Supports easy evaluation of alternative measures (e.g. est

store)o Culture of transparency of programming/data manipulation

Problems…o Not available to all users o Not easily incorporated in generic services

11

Variables and functional form

Functional form = the way in which measures are arithmetically incorporated in quantitative analysis

With occupations, education, ethnicity, and elsewhere, we tend to be too willing to make simplifying categorisations

o Multiple categorisations are possibleo As are scaling approaches – better suited for complex

analytical procedures

12

Good habits: Keep clear records of DM activities

Reproducible (for self)Replicable (for all)Paper trail for whole

lifecycleCf. Dale 2006; Freese 2007

In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata)

Syntax Examples: www.longitudinal.stir.ac.uk

13

Principle: Use existing standards and previous research

Variable operationalisationsUse recognised recodes / standard classifications

• NSI harmonisation standards (e.g. ONS)• Cross-national standards [Hoffmeyer-Zlotnick & Wolf 2003;

Harkness et al. 2005; Jowell et al. 2007] • Research reviews [e.g. Shaw et al. 2007]• Common v’s best practices (e.g. dichotomisations)

Use reproducible recodes / classifications (paper trail)

Other data file manipulations• Missing data treatments• Matching data files (finding the right data)

14

Principle: Do something, not nothing

We currently put much more effort into data collection and data analysis, and neglect data manipulation

Survey research – the influence of ‘what was on the archive version’

…In my experience, a common reason why people didn’t do more DM was because they were frightened to…

15

Principle: Support linking data

Complex data (complex research) is distributed across different files. In surveys, use key linking variables for... One-to-one matching

SPSS: match files /file=“file1.sav” /file=“file2.sav” /by=pid. Stata: merge pid using file2.dta

One-to-many matching (‘table distribution’)SPSS: match files /file=“file1.sav” /table=“file2.sav” /by=pid .Stata: merge pid using file2.dta

Many-to-one matching (‘aggregation’)SPSS: aggregate outfile=“file3.sav” /meaninc=mean(income) /break=pid. Stata: collapse (mean) meaninc=income, by(pid)

Many-to-Many matches

Related cases matching

16

Challenges..

Agreeing about variable constructions

Unresolved debates about optimal measures and variables

Esp. in comparative research such as across time, between countries

In DAMES, we have particular interests in comparability for: Longitudinal comparability (

http://www.longitudinal.stir.ac.uk/variables/) Scaling / scoring categories to achieve ‘meaning equivalence’

or ‘specific measures’

17

Challenges..

Incentivising documentation / replicability

There is little to press researchers to better document DM, but much to press them not to

• Make DM and its documentation easier?• Reward documentation (e.g. citations)?

iii) Ethnicity research agendas

Our impressiono More data on more referentso Controlled access to datao Increasing recognition of intergenerational change o Mixed identities

Other views…?

18

Further comments/ discussion/ questions

…..

19

Review and consultation: Next steps in supporting data on ethnicity

Documents

Transcript of Review and consultation: Next steps in supporting data on ethnicity