(Policy) research with confidential micro data Eric J. Bartelsman Vrije Universiteit Amsterdam...
-
Upload
doreen-mosley -
Category
Documents
-
view
214 -
download
0
Transcript of (Policy) research with confidential micro data Eric J. Bartelsman Vrije Universiteit Amsterdam...
(Policy) research (Policy) research with confidential micro datawith confidential micro data
Eric J. BartelsmanEric J. BartelsmanVrije Universiteit Amsterdam Vrije Universiteit Amsterdam
Tinbergen InstituteTinbergen Institute
Expertenworkshop Ondernemingsdata in BelgiëExpertenworkshop Ondernemingsdata in BelgiëBrussels, September 25 2009Brussels, September 25 2009
OverviewOverview
• Benefits of using linked longitudinal firm-Benefits of using linked longitudinal firm-level datasetslevel datasets
• International experienceInternational experience• Modes of access to confidential firm-level Modes of access to confidential firm-level
datasetsdatasets
Benefits of using firm-level Benefits of using firm-level datadata
• Improving quality of statistics Improving quality of statistics • Testing of theories at firm-levelTesting of theories at firm-level• Providing ‘moments’ for modellingProviding ‘moments’ for modelling• Policy evaluationPolicy evaluation
Benefits of using firm-level Benefits of using firm-level datadata
• Improving quality of statistics Improving quality of statistics • Assessing quality of published statsAssessing quality of published stats
• New uses for old dataNew uses for old data
• Uncovering new collection methods and new data Uncovering new collection methods and new data needsneeds
• Testing of theories at firm-levelTesting of theories at firm-level• Providing ‘moments’ for modellingProviding ‘moments’ for modelling• Policy evaluationPolicy evaluation
Data QualityData Quality
• In-house use at National Stats Office (NSO):• Consistency in x-sect and longitudinal• Integration: top-down vs bottoms-up
• External users: • quality improvement criteria• Systematic learning from external users
New uses for ‘old’ dataNew uses for ‘old’ data
• Linking of multiple sourcesLinking of multiple sources• link NSO surveys to Business Registerlink NSO surveys to Business Register• cross-linking with other registerscross-linking with other registers
• Housing, transport, labor, taxHousing, transport, labor, tax
• Linking with external surveysLinking with external surveys
• Creation of new indicators from linked dataCreation of new indicators from linked data• Gross FlowsGross Flows• Higher moments; CorrelationsHigher moments; Correlations• New disaggregationsNew disaggregations
• Subsamples: region, industry, size, typeSubsamples: region, industry, size, type
New collection methodsNew collection methods
• Links to registers allows for mass imputation of Links to registers allows for mass imputation of small samplessmall samples
• Collection of data at ‘transactions’ siteCollection of data at ‘transactions’ site• New types of info from linking disparate sourcesNew types of info from linking disparate sources
• Example: linked geographic info for disaster planning.Example: linked geographic info for disaster planning.
Uncovering data needsUncovering data needs
• Micro-level research reveals useful indicatorsMicro-level research reveals useful indicators• Employment gross flows (US/BLS)Employment gross flows (US/BLS)• Firm demographics (Eurostat)Firm demographics (Eurostat)
• Interactions with external researchers Interactions with external researchers improves understanding of users needs at improves understanding of users needs at NSOsNSOs
• Gaps in available data are identified through Gaps in available data are identified through researchresearch
Benefits of using Firm-level dataBenefits of using Firm-level data
• Improving quality of statistics• Testing of theories at firm-level
• Firm-level data now used in many fields: IO, Trade, Labor, Finance, Management, Organization, Macro
• Recent improvements in modelling heterogeneous firms• Variation in costs (… of learning, transport, etc)• Usually representative consumer, constant mark-up
• Application of econometric techniques (GMM, clever instruments) to cope with endogeneity
• Providing ‘moments’ for modellingProviding ‘moments’ for modelling• Policy evaluationPolicy evaluation
Benefits of using Firm-level dataBenefits of using Firm-level data
• Improving quality of statistics• Testing of theories at firm-level
• Providing ‘moments’ for modellingProviding ‘moments’ for modelling• Information drawn from linked longitudinal firm-level
distributions can be used to calibrate models.
• Especially the ability to do cross-country comparisons is promising
• Policy evaluation
Benefits of using Firm-level dataBenefits of using Firm-level data
• Improving quality of statistics• Testing of theories at firm-level• Providing ‘moments’ for modelling• Policy evaluationPolicy evaluation
• Individual decision making units respond to policyIndividual decision making units respond to policy• Track decisions and outcomes from longitudinal micro dataTrack decisions and outcomes from longitudinal micro data• No need to infer result from movement in aggregateNo need to infer result from movement in aggregate
• Identification requires a control groupIdentification requires a control group• Implementation of policy differ across cells (locations, between types of units, Implementation of policy differ across cells (locations, between types of units,
or over time)or over time)• Effect of policy differs across cells (ie highways affect transport-intensive Effect of policy differs across cells (ie highways affect transport-intensive
firms)firms)
• Cross-country comparisons for identificationCross-country comparisons for identification
International Experience
History of micro data access:– Stats Norway: early 1970s– US Census: late 1980s
Typical attitude of NSO before allowing access– Micro data is too difficult, You can’t really do that with data, and,
we don’t trust you to use the data, Absolute security is required– Well, maybe we can think of how to allow access….
Now: At least 25 NSOs have facilities for micro data research– Also, they use the backbone as basis of statistical process:
enormous gains in productivity
International Experience
Situation in EU countries– Business Register, VAT register, SS register, Business
Surveys
– Some have on-site, others have remote access: Fin, Swe, Dnk, UK, Nld, Slo, Est, Some have excellent in house research: Fra
In other countries a variety of situations: ad-hoc sharing of data, on-site, trusted third part)
Modes of access to confidential Modes of access to confidential micro datamicro data
• Research shop within stats agencyResearch shop within stats agency• On-site facility with access rules for On-site facility with access rules for
external researchersexternal researchers• Secure remote-access for external Secure remote-access for external
researchersresearchers• Remote executionRemote execution• Distributed micro data analysisDistributed micro data analysis
• how to share unsharable datahow to share unsharable data
Issues to considerIssues to consider
Absolute certainty about confidentiality of Absolute certainty about confidentiality of datadata
Uniqueness of published official statisticsUniqueness of published official statistics Requirements for accessRequirements for access Resource cost sharingResource cost sharing
Confidentiality
Must weigh costs and benefits – What is ‘cost’ of confidential data being released
Relate to costs of not allowing access to data: Increasing irrelevance of stats agency and hopefully extreme budget cuts
– Don’t just look at technical side of disclosure What is likelihood of malice or fraud Look at ease of getting same or better confidential data
elsewhere
Uniqueness
The ‘one published number’ view of stats agencies conflicts with reliability– We all know numbers don’t add up and that different
assumptions generate different stats. So, openness, replicability, review, robustness testing by others will enhance reputation of stats agency publications
Research output can be labelled as such with a disclaimer
Requirements for access
Create (legal) framework for allowing access by external researchers– Screening of projects and research teams – Special employee status
Create technical facilities– Database architecture– Meta data– On-site laboratory– Remote-access facilities
Distributed Micro Data ResearchDistributed Micro Data Research
• Distributed Micro Data research was developed to Distributed Micro Data research was developed to allow cross-country research using confidential allow cross-country research using confidential firm-level data that could not be combinedfirm-level data that could not be combined
• The key is to ‘micro-aggregate’ underlying micro The key is to ‘micro-aggregate’ underlying micro data into cells that pass disclosure and data into cells that pass disclosure and • Provide enough information for further analysis, and/orProvide enough information for further analysis, and/or• Can be merged at cell-level with other sourcesCan be merged at cell-level with other sources
• DMD can be viewed as system to allow customer-DMD can be viewed as system to allow customer-driven publication of statisticsdriven publication of statistics
• ‘‘Moments’ are useful for economic modellingMoments’ are useful for economic modelling
•SC LMD
EUKLEMS
Longitudinal Micro Data
National Accounts Industry Data
Single countryMacro and
Sectoral Timeseries
Surveys,Business Registers
Multiple countries
N.A.
Data for Cross-countryFirm-level Analysis
•DMD
EUKLEMS+
Provision of metadata.Approval of access.Execution of CodeDisclosure analysis
of DMD tables.Disclosure analysis of Publication
Res
earc
her
Policy QuestionResearch Design Program Code
Publication
Res
earc
hN
etw
ork Metadata
Networkmembers
DMDTables
NS
Os
Distributed Micro Data Analysis
DMD ProjectsDMD Projects
OECD 2000-2003OECD 2000-2003 World Bank 2006World Bank 2006
– Followup 2009-2011Followup 2009-2011
EU/NL 2007EU/NL 2007 Eurostat ICT Impacts 2008-2009Eurostat ICT Impacts 2008-2009
– Followup 2010Followup 2010
Analytical uses of DMD datasetsAnalytical uses of DMD datasets
• Creation of new indicators from linked dataCreation of new indicators from linked data• Definition of cells based on complex longitudinal characteristicsDefinition of cells based on complex longitudinal characteristics
• e.g.Employer-employee matchede.g.Employer-employee matched
• ‘‘Event’ studies (tracking sub-populations based on prior characteristics)Event’ studies (tracking sub-populations based on prior characteristics)• Indicators may be high-moments, correlations, regression coefficients, etc.Indicators may be high-moments, correlations, regression coefficients, etc.
• e.g. correlation of profitability and employee gender-ratio, by industry, region e.g. correlation of profitability and employee gender-ratio, by industry, region and timeand time
• Linking of outside data sources at cell-levelLinking of outside data sources at cell-level• Generate custom tabulations of data to match cells of other published or Generate custom tabulations of data to match cells of other published or
DMD datasetsDMD datasets• e.q. labor force gender-ratio by region and timee.q. labor force gender-ratio by region and time
• Cross-country analysis with panels with the same cell level Cross-country analysis with panels with the same cell level definitionsdefinitions
Uses of DMD for Policy Uses of DMD for Policy EvaluationEvaluation
• Individual decision making units respond to Individual decision making units respond to policypolicy
• Track decisions and outcomes from longitudinal micro dataTrack decisions and outcomes from longitudinal micro data• No need to infer result from movement in aggregateNo need to infer result from movement in aggregate
• Identification requires a control groupIdentification requires a control group• Implementation of policy differ across cells (locations, Implementation of policy differ across cells (locations,
between types of units, or over time)between types of units, or over time)• Effect of policy differs across cells (ie highways affect Effect of policy differs across cells (ie highways affect
transport-intensive firms) transport-intensive firms)
Implementing efficient firm-level Implementing efficient firm-level data analysisdata analysis
• Technical facilitiesTechnical facilities• Meta-data librariesMeta-data libraries• Disclosure analysis and rules for re-use of Disclosure analysis and rules for re-use of
extracted datasetsextracted datasets
Technical FacilitiesTechnical Facilities
• Back-bones for universe of statistical unitsBack-bones for universe of statistical units• Firms, Households, Dwellings, etcFirms, Households, Dwellings, etc
• Relational database organisation of data and meta-Relational database organisation of data and meta-datadata
• Statistical tools inside relational database Statistical tools inside relational database programming environmentprogramming environment
• Remote access or remote executionRemote access or remote execution• Remote access allows data visualisation, interactive data checkingRemote access allows data visualisation, interactive data checking
Meta-dataMeta-data
Ideal application of meta-dataIdeal application of meta-data– Be able to write generic code remotelyBe able to write generic code remotely
– Convert code to run locally, using meta-dataConvert code to run locally, using meta-data
Meta-data set up to describeMeta-data set up to describe– available datasetsavailable datasets
– unique record identifiersunique record identifiers
– classificationsclassifications
– ‘‘economic variables’economic variables’
Necessary meta-dataNecessary meta-data
list of available forms and scheduleslist of available forms and schedules info on record identifiers (Firm_id, person_id) info on record identifiers (Firm_id, person_id) info on ‘economic variables’info on ‘economic variables’ info on classificationsinfo on classifications concordances between units concordances between units concordances between variablesconcordances between variables concordances to standard classificationsconcordances to standard classifications
Underlying Metadata: Underlying Metadata: datasourcesdatasources
Survey Survey TypeType
NameName Unique keysUnique keys LocationLocation
BRBR GenBusRegGenBusReg FID, yearFID, year G:\dirxG:\dirx
PSPS SBS_yyyySBS_yyyy FIDFID G:\diryG:\diry
ECEC ECS_yyyyECS_yyyy FIDFID G:\dirzG:\dirz
ISIS InvS_yyyyInvS_yyyy FIDFID G:\dirzG:\dirz
Underlying Metadata: Underlying Metadata: variables in surveyvariables in survey
NameName DescriptionDescription UnitsUnits DomainDomain
FIDFID Unique FirmIDUnique FirmID stringstring GBRGBR
IndCIndC Detailed Detailed industry codeindustry code
stringstring ISIC r3 ISIC r3
Q1Q1 Use of ITUse of IT integerinteger YNMYNM
PurchSPurchS Software ExpSoftware Exp Eur (1000)Eur (1000)
ECS_1999
Underlying Metadata: Underlying Metadata: classifications of domainsclassifications of domains
IndCIndC DescriptionDescription
TOTTOT Total EconomyTotal Economy
AGAG Agriculture, Fishing, ForestryAgriculture, Fishing, Forestry
0101 FarmsFarms
MFGMFG ManufacturingManufacturing
27t3527t35 DurablesDurables
2727 Basic MetalsBasic Metals
ISICr3
Underlying Metadata: Underlying Metadata: ConcordancesConcordances
IndCIndC ICTindICTind
0101 OtherOther
……
1212 OtherOther
……
2727 27a827a8
2828 27a827a8
IndC_ICTind
Disclosure AnalysisDisclosure Analysis
• Can be fairly automated, based on cell-count and Can be fairly automated, based on cell-count and ‘concentration’‘concentration’
• Further, rules may be instated about further use of Further, rules may be instated about further use of DMD dataset. For example, requirement that DMD dataset. For example, requirement that dataset be erased after use will reduce worries dataset be erased after use will reduce worries about secondary disclosure.about secondary disclosure.
• Checking may also be required on final Checking may also be required on final publicationpublication