© Federal Statistical Office, Research Data Centre, Maurice Brandt Folie 1 Analytical validity and...

13
Folie 1 Federal Statistical Office, Research Data Centre, Maurice Brandt Analytical validity and confidentiality protection of anonymised longitudinal enterprise microdata Survey of a German Project Maurice Brandt 1 , Michael Konold 2 , Rainer Lenz 3 and Martin Rosemann 4 Research Data Centres of the Federal Statistical Office 1 and the Statistical Offices of the Länder 2 , University of Applied Sciences Mainz 3 Institute for Applied Economic Research 4 Work session on statistical data confidentiality Manchester 17-19 December 2007

Transcript of © Federal Statistical Office, Research Data Centre, Maurice Brandt Folie 1 Analytical validity and...

Page 1: © Federal Statistical Office, Research Data Centre, Maurice Brandt Folie 1 Analytical validity and confidentiality protection of anonymised longitudinal.

Folie 1© Federal Statistical Office, Research Data Centre, Maurice Brandt

Analytical validity and confidentiality protection of anonymised longitudinal enterprise microdata –

Survey of a German Project

Maurice Brandt1, Michael Konold2, Rainer Lenz3 and Martin Rosemann4

Research Data Centres of the Federal Statistical Office1 and the Statistical Offices of the Länder2,

University of Applied Sciences Mainz3

Institute for Applied Economic Research4

Work session on statistical data confidentiality Manchester 17-19 December 2007

Page 2: © Federal Statistical Office, Research Data Centre, Maurice Brandt Folie 1 Analytical validity and confidentiality protection of anonymised longitudinal.

Folie 2© Federal Statistical Office, Research Data Centre, Maurice Brandt

Overview

1. Introduction

2. The data sets of the project

3. Anonymisation methods and analytical validity

4. Approaches to assessing anonymity

5. Conclusions

Page 3: © Federal Statistical Office, Research Data Centre, Maurice Brandt Folie 1 Analytical validity and confidentiality protection of anonymised longitudinal.

Folie 3© Federal Statistical Office, Research Data Centre, Maurice Brandt

1. Introduction

“Business Panel data and de facto anonymisation” new project since the beginning of 2006

improve the data infrastructure in Germany regarding longitudinal data on local units and enterprises

guarantee the access of the scientific community to the panel data of economic statistics

the formerly project “De facto anonymisation of business microdata” has shown that de facto anonymisation can be achieved on a cross-section basis

Page 4: © Federal Statistical Office, Research Data Centre, Maurice Brandt Folie 1 Analytical validity and confidentiality protection of anonymised longitudinal.

Folie 4© Federal Statistical Office, Research Data Centre, Maurice Brandt

1. Introduction

In this project different business statistics are linked to longitudinal datasets

it is planned to complement the data with information from the official business register

the data sets can already be used for scientific work

the final aim is to produce a scientific use file

Page 5: © Federal Statistical Office, Research Data Centre, Maurice Brandt Folie 1 Analytical validity and confidentiality protection of anonymised longitudinal.

Folie 5© Federal Statistical Office, Research Data Centre, Maurice Brandt

2.1 The data sets of the projectUnits of analysis are the local units in manufactoring and mining Complete enumeration of local units with 20 or more employees

Monthly reports years from 1995 to 2005 Information about employees, wages, salaries, turnover Survey of investments years from 1995 to 2005 Information on highly different types of investments

Survey of small units years from 1995 to 2002 Local units with 19 or fewer employees

Page 6: © Federal Statistical Office, Research Data Centre, Maurice Brandt Folie 1 Analytical validity and confidentiality protection of anonymised longitudinal.

Folie 6© Federal Statistical Office, Research Data Centre, Maurice Brandt

2.2 The data sets of the project

Cost Structure Survey Stratified sample of enterprises with 20 or more employees in the manufacturing and mining sector

years from 1995 to 2005 all together over 43.000 enterprises Information on output, production factors, employees from 1999 to 2002 13.300 enterprises available in the whole period studies regarding investments in research and development are possible

Page 7: © Federal Statistical Office, Research Data Centre, Maurice Brandt Folie 1 Analytical validity and confidentiality protection of anonymised longitudinal.

Folie 7© Federal Statistical Office, Research Data Centre, Maurice Brandt

2.3 The data sets of the projectTurnover Tax Statistics Very large data set of a total of 4.3 million enterprises years from 2000 to 2004 (1.8 million for the whole period)

Information on all taxable turnovers, turnover tax, prior tax and of tax liability

IAB Panel of local units Information on employment trend, staff structure, hours worked, turnover, export share, investments and innovation

Since year 1993 various waves on about 4.300 to a max. of 16.000 local units

Page 8: © Federal Statistical Office, Research Data Centre, Maurice Brandt Folie 1 Analytical validity and confidentiality protection of anonymised longitudinal.

Folie 8© Federal Statistical Office, Research Data Centre, Maurice Brandt

3. Anonymisation methods and analytical validityAnonymisation methods methods reducing the information (suppression of variables or presenting key variables in broader categories) methods modifying the values of numerical data (data perturbating methods)

Data perturbating methods for panel data Micro aggregation: (a) separately for all variables and all periods (Individual Ranking), (b) separately for all variables but jointly for all periods, (c) separately for all periods but jointly for all variables and (d) jointly for all periods and all variable Multiplicative stochastic noise: mixture distribution (approach of Höhne) Multiple Imputation

Page 9: © Federal Statistical Office, Research Data Centre, Maurice Brandt Folie 1 Analytical validity and confidentiality protection of anonymised longitudinal.

Folie 9© Federal Statistical Office, Research Data Centre, Maurice Brandt

3. Anonymisation methods and analytical validityIn FocusImpacts of data perturbating methods on descriptive distribution measures the estimation of econometric panel models, particularly on the within-estimator to control for individual unobservable heterogeneity

First Results the within estimator is consistent in the case of anonymisation by individual ranking Project team derived consistent within-estimators in the case of anonymisation by multiplicative stochastic noise (including the method of Höhne) and no autocorrelation Case of autocorrelation: work in progress Multiple Imputation: separate speech on this conference

Page 10: © Federal Statistical Office, Research Data Centre, Maurice Brandt Folie 1 Analytical validity and confidentiality protection of anonymised longitudinal.

Folie 10© Federal Statistical Office, Research Data Centre, Maurice Brandt

4. Approaches to assessing anonymity

We calculate coefficients ),( ji bad

(AP) Minimize ,),(11

n

jijji

n

i

xbad

s.t. ,,...,1,for}10{ nji,xij

nixn

jij ,...,1for1

1

.,...,1for11

njxn

iij

and

and obtain:

{a1,...,an} external data

{b1,...,bn} target data

Page 11: © Federal Statistical Office, Research Data Centre, Maurice Brandt Folie 1 Analytical validity and confidentiality protection of anonymised longitudinal.

Folie 11© Federal Statistical Office, Research Data Centre, Maurice Brandt

4. Approaches to assessing anonymity

Four approaches in order to estimate the coefficients of the linear program (AP) are used:

Conventional distance based approach Correlation based approach Distribution based approach Collinearity based approach

Page 12: © Federal Statistical Office, Research Data Centre, Maurice Brandt Folie 1 Analytical validity and confidentiality protection of anonymised longitudinal.

Folie 12© Federal Statistical Office, Research Data Centre, Maurice Brandt

5. Conclusions

Within the scope of the project the panel data sets can be used by remote data processing safe scientific work stations in the office

They are already used in some research projects

First scientific use files for data use on one‘s own workstation are probably available at the beginning of 2009

Page 13: © Federal Statistical Office, Research Data Centre, Maurice Brandt Folie 1 Analytical validity and confidentiality protection of anonymised longitudinal.

Folie 13© Federal Statistical Office, Research Data Centre, Maurice Brandt

Thank you for your attention