Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment...

27
Statistical business registers: IT considerations V. Todorov 1 1 United Nations Industrial Development Organization, Vienna Regional Workshop on the Statistical Business registers for the Arab Countries 26-29 September 2016 Amman, Jordan Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 1 / 27

Transcript of Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment...

Page 1: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

Statistical business registers: IT

considerations

V. Todorov1

1United Nations Industrial Development Organization, Vienna

Regional Workshop on the Statistical Business registers for

the Arab Countries

26-29 September 2016

Amman, Jordan

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 1 / 27

Page 2: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

Outline

1 Introduction

2 General considerations

3 Database Management System (DBMS)

4 Programming requirements

5 Software for record linkage

6 Data retention

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 2 / 27

Page 3: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

Introduction

Outline

1 Introduction

2 General considerations

3 Database Management System (DBMS)

4 Programming requirements

5 Software for record linkage

6 Data retention

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 3 / 27

Page 4: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

Introduction

Introduction

IT infrastructure and programming requirements for the build

phase of an SBR system

• When establishing any system, there are many possible

technologies.

• Of course this is valid for SBR too.

• The choice should take into account

I scalability,I cost andI maintenance.

• The technology should be flexible enough to evolve with

new requirements.

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 4 / 27

Page 5: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

General considerations

Outline

1 Introduction

2 General considerations

3 Database Management System (DBMS)

4 Programming requirements

5 Software for record linkage

6 Data retention

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 5 / 27

Page 6: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

General considerations

General considerations

• There is no international standard or even commonly used

practice amongst NSIs regarding the design of an SBR

system per se.

• The main consideration is to develop an SBR system that

I fits within the NSI IT architecture andI it is as compatible as possible with other systems like the

administrative data acquisition systems and the business

survey collection systems.

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 6 / 27

Page 7: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

General considerations

General considerations

• Project management methodology

• Software development methodology

• Solution architecture: functional and non-functional

requirements; layered architecture

• Database: relational database management system

(RDBMS)

• Frame, collection and respondent burden modules

• Documentation

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 7 / 27

Page 8: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

General considerations

General considerations:Layered architecture

Simple layered architecture

Presentation layer Implements the user interface and manages user interaction

with the system.

Service layer Exposes interfaces and system functionality to other systems,

and may also be the boundary between the presentation and

business layers.

Business layer Implements the core functionality and business logic.

Data access layer Implements access to and interaction with data stores.

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 8 / 27

Page 9: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

Database Management System (DBMS)

Outline

1 Introduction

2 General considerations

3 Database Management System (DBMS)

4 Programming requirements

5 Software for record linkage

6 Data retention

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 9 / 27

Page 10: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

Database Management System (DBMS)

Database Management System (DBMS) options

• Database Management System (DBMS)

I ORACLE or MS SQL ServerI MySqlI SQLiteI MS ACCESSI Other more exotic ones: MaxDB

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 10 / 27

Page 11: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

Programming requirements

Outline

1 Introduction

2 General considerations

3 Database Management System (DBMS)

4 Programming requirements

5 Software for record linkage

6 Data retention

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 11 / 27

Page 12: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

Programming requirements

Programming requirements

• Graphical user interface (GUI)

• Programming language

I .Net and C#I JavaI PythonI R, SAS

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 12 / 27

Page 13: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

Programming requirements

Programming requirements

• IT Environments: a full scale solution may involve fivedistinct environments

1. Production environment

2. Practice (training) environment

3. User acceptance environment

4. Development environment

5. Analysis environment

• The first and fifth environments are essential; all forms of

testing and analysis can take place in environment 5.

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 13 / 27

Page 14: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

Programming requirements

Programming requirements

Establishing a unique identifier for statistical units

• Essential for accurate maintenance of the SBR.

• Sequential assignment of unique identifiers, managed

centrally.

• Reduces the risk of inadvertently disclosing confidential

micro-data by use of easily identifiable and recognizable

information.

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 14 / 27

Page 15: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

Programming requirements

Establishing a unique identifier for statistical units

Some of the key elements to consider when creating a unique

identifier for the SBR

• Create an identification numbering system for each

statistical unit, no matter what type

• Use a non-confidential identifier in order to facilitate the

statistical processing.

• Ensure these unique identifiers have no meaning.

• Ensure that the unique identifiers cannot be reused.

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 15 / 27

Page 16: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

Programming requirements

Establishing a unique identifier for statistical units

Technical considerations in the generation of a unique identifier

are as follows

• Include a check digit function (algorithm for calculating a

check digit according to Modulo 11 is included in Annex

E2).

• Use can be made of a key generator function.

• preferably use alpha characters combined with numbers (to

avoid confusion with any other numeric data).

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 16 / 27

Page 17: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

Software for record linkage

Outline

1 Introduction

2 General considerations

3 Database Management System (DBMS)

4 Programming requirements

5 Software for record linkage

6 Data retention

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 17 / 27

Page 18: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

Software for record linkage

Software for record linkage

• According to the Guidelines: the results quoted for the

performance of automated data matching tools and

software tend to be overly optimistic.

• Deterministic matching or probabilistic record linkage oftenyields:

I mismatches when the matching rules are too loose andI a high percentage of missed matches when the rules are

too rigid.

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 18 / 27

Page 19: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

Software for record linkage

Software for record linkage

• Commercial Software

I Most of them are “black box” from the users’ perspective

(the source code of their linkage engines is not available

for inspection).I Specialized to a certain domain, e.g. de-duplication of

customer mailing listsI Affordable are only smaller systems limited in their ability

to process different data types and; limited functionality;

can process only small data sets.

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 19 / 27

Page 20: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

Software for record linkage

Software for record linkage

• Open Source and Free Software

I Allow access to the source code of their linkage engines.I Free of charge - although this not always means “no costs”.I Flexible and extendible.I Include large number of linkage techniques.I Allow the practitioner to experiment with traditional as

well as advanced linkage techniques; the user is able to

understand up to a certain degree, many technical details.

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 20 / 27

Page 21: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

Software for record linkage

Software for record linkage

• RELAIS

I ISTATI Implemented in Java and R (both languages are open

source and can be used on different platforms)I Graphical User Interface (GUI) available, written in JavaI Input and output data in relational database—mySql—also

open source productI Available for all major platformsI https://joinup.ec.europa.eu/software/relais/

description

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 21 / 27

Page 22: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

Software for record linkage

Software for record linkage

• FEBRL

I Freely Extensible Biomedical Record Linkage (FEBRL)

SystemI The Australian National University, CanberaI Implemented in Python (free object oriented programming

language)I Graphical User Interface (GUI) availableI Input from text files (CSV), SQL in the futureI Available for all major platformsI Contains many recently developed record linkage

techniquesI http://sourceforge.net/projects/febrl/

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 22 / 27

Page 23: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

Software for record linkage

Software for record linkage

• RecordLinkage R package

I An R package available from CRANI Machine learning methods are utilizedI Decision trees (rpart), bootstrap aggregating (bagging),

ada boost (ada), neural nets (nnet) and support vector

machines (svm).I http://cran.r-project.org/web/packages/

RecordLinkage/index.html

I http://journal.r-project.org/archive/2010-2/

RJournal_2010-2_Sariyar+Borg.pdf

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 23 / 27

Page 24: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

Data retention

Outline

1 Introduction

2 General considerations

3 Database Management System (DBMS)

4 Programming requirements

5 Software for record linkage

6 Data retention

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 24 / 27

Page 25: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

Data retention

Data retention

• SBR data retention strategy should be articulated in

accordance with operational and analytical needs

• Should begin with the determination on how changes made

to the SBR will be tracked and what historical information

will need to be kept.

• Tracking changes

• Frequency and content of snapshots

• Administrative updates

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 25 / 27

Page 26: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis

Data retention

Example from Statistics Canada

Example from Statistics Canada

A complete copy—snapshot—of the SBR database (live register)

is taken just prior to the first day of every month. A generalized

survey universe file (GSUF), i.e. frozen frame, containing every

statistical unit, is created from the snapshot every month.

Although frozen frames are primarily used for sampling, normally

soon after their creation, they are retained for an extended

period for analysis purposes.

Monthly frozen frames Retention period

January Indefinite

February to December 24 months

Todorov (UNIDO) IT considerations 26-29.09.2016, Amman 26 / 27

Page 27: Statistical business registers: IT considerations › sites › › ... · 5.Analysis environment • The rst and fth environments are essential; all forms of testing and analysis