1 S-DWH Architecture (Recap): Summary : A Data Warehouse as statistical production system. S-DWH...

32
1 S-DWH Architecture (Recap): Summary: A Data Warehouse as statistical production system. S-DWH Architectural Domains S- DWH layered architecture National Institute of Statistics – Italy Antonio Laureti Palma - IT Business Statistics 4 th ESS-Net Workshop on “Micro data linking and warehousing in statistical production “Tallinn, Statistics Estonia March 20th and 21st, 2013

Transcript of 1 S-DWH Architecture (Recap): Summary : A Data Warehouse as statistical production system. S-DWH...

Page 1: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

1

S-DWH Architecture (Recap):

Summary: A Data Warehouse as statistical production system. S-DWH Architectural Domains S- DWH layered architecture

National Institute of Statistics – Italy Antonio Laureti Palma - IT Business Statistics

4th ESS-Net Workshop on “Micro data linking and warehousing in statistical production “Tallinn, Statistics Estonia March 20th and 21st, 2013

Page 2: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

Data Warehouse

In an enterprise a data warehouse (DWH) is an organized information collection which stores current and historical data used for creating reports for management.

Reports are generally produced from already structured information or continual data mining carried out by experts.

Data mining is realized by using advanced statistical methods, correlating “primary information ” from different production departments.

The delivery of reports is carried out using “secondary information” stored in specialized Data Marts.

The discovery of useful information aids business strategies which should increase efficiency in the production process.

Page 3: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

PRODUCTION

SALES

RESOURCES

DISTRIBUTIONprimary

information

Staging Area Data Integration Data Warehouse Data Mart

primary information

secondary information

secondary information

DECISION

REPORTS

EXTERNAL OPERATIONAL

operational information

ET

L

S-DWH overview architecture

feedback

DATA MINING

Page 4: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

Statistical Data Warehouse

In an Enterprise the information collected in a DWH derives from different production departments. For producing Primary Information, specific Extraction Transformation and Loading (ETL) procedures are needed.

In a S-DWH (Statistical-DWA) the sources are direct data capturing or administrative archives and the ETL procedures must manage the effective statistical elaboration.

A S-DWH is a coherent information collection of current and historical data of different statistical topics and domains.

A high level of coordination is necessary both within different topics and within different operational phase activities, and between the topics and activities.

Page 5: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

To define and enable the evolution of a S-DWH requires a definition of a framework where key principles and models can be created, communicated and improved.

A S-DWH framework comprises:

business domain to align strategic objectives and tactical demands through a common understanding of the organization.

Information architecture domain, to describe data base organization and management of data and metadata information.

technology domain, i.e. the combined set of software, hardware and networks to develop and support IT services.

5

S-DWH Architectural domains

Page 6: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

In the business domain, key business processes that should be considered are:

Adapting or Redesigning of statistical regulation (timing and output);

Managing changes from stove-pipe approaches to a coherent S-DWH;

Redesigning operational management ( methodology and processes);

Data management (security, custodianship and ownership, data and metadata);

Quality management (assessment and control);

Software and IT infrastructure management.

6

S-DWH business domain

Page 7: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

To organize a S-DWH information architecture, we group functionalities in four distinct layers:

access layer is designed for a wide typology of users or informatics instruments for the final presentation of the information sought;

interpretation and data analysis layer is designed for interactive non-structured human activities;

integration layer is designed for the ETL functions, which should be realized automatically or semi-automatically;

source layer is designed for storing and managing internal (surveys) or external (archives) raw data sources.

7

S-DWH information domain

Page 8: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

CATI

CAWI

CAPIprimary

information

DATA MINING

Staging Area Data Integration Data Warehouse Data Mart

primary information

secondary information

secondary information

DECISION

REPORTS

ADMIN

OPERATIONAL

operational information

ET

L

S-DWH information domain overview

Interpretation and Analysis Layer

AccessLayer

Integration Layer

SourceLayer

Page 9: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

ESS-Net S-DWH functional definition:

A statistical data warehouse is a central statistical data store, regardless of the data’s source, for managing all available data of interest, improving the NSI to:

- (re)use data to create new data/new outputs;

- perform reporting;

- execute analysis;

- produce the necessary information.

S-DWH information domain

Page 10: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

SOURCES LAYER

INTEGRATION LAYER

INTERPRETATION AND ANALYSIS LAYER

ACCESS LAYER

produce the necessary information

new outputsperform reporting

produce the necessary information

re-use data to create new data execute analysis

S-DWH information domain

Page 11: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

In the technology domain the layered architecture reflects a conceptual organization in which we will consider the first two levels as pure statistical operational infrastructures, functional for acquiring , storing, editing and validating data, and the last two layers as the effective data warehouse, i.e. levels in which data are accessible for data analysis.

These reflect two different IT environments, an operational where we support semi-automatic computer interaction systems and an analytical, the warehouse, where we maximize human free interaction.

STAT

ISTI

CAL

DATA

WAR

EHO

USE

SOURCES LAYER

INTEGRATION LAYER

INTERPRETATION AND ANALYSIS LAYER

ACCESS LAYERDATA WAREHOUSE

OPERATIONAL DATA

S-DWH technology domain

Page 12: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

ACCESS LAYER

INTERPRETATION ANDANALYSIS LAYER

ROLAP, Relational Online Analytical Processing, uses specific analytical tools on a relational dimensional data model which is easy to understand and does not require pre-compuation and storage of the information.

MOLAP (Multidimensional Online Analytical Processing) uses specific analytical tools on a multidimensional data model.

In the Technology domain each layer must support different process typologies:

INTEGRATION LAYER

SOURCE LAYERData mapping involves combining data residing in different sources and providing users with a unified view of these data.

OLTP (OnLine Transaction Processing) is the typical operational activity for data editing.

S-DWH technology domain

Page 13: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

The Access Layer is the layer for the final presentation, dissemination and delivery of the information sought.

This layer is designed for a wide typology of users and computer instruments.

In this layer the data organization must support automatic dissemination systems and BI-Tools (Business Intelligence). In both cases, statistical information is structured in data models at micro and macro data levels.

Access Layer

Statistical Data Warehouse layered architecture

Page 14: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

DWA - Access Layer

Specialized functionalities:

SDMX interface (Statistical Data and Metadata eXchange) supports cross-border service interoperability for public administrations or organizations.

Specialized BI-Tools is an extensive category, in terms of solutions on the market, of products for query building, data-navigating, web browsing, graphics and publishing.

Office Automation tools: this is a reassuring solution for users who come for the first time to the data warehouse context, as they are not forced to learn new complex instruments. The problem is that this solution, while adequate with regard to productivity and efficiency, is very restrictive in the use of the data warehouse, since these instruments, have significant architectural and functional limitations;

Page 15: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

DWA - Access Layer

From the GSBPM we consider follow sub-processes: 7- Disseminate7.1-”update output systems” , including re-formatting data and metadata into specific output databases.

7.2-”produce dissemination” is a sort of integration process between table, text and graphs. 7.3-”manage release of dissemination products” ensures that all elements in place for releasing.

7.4-  ” promote dissemination ”, it includes wikis, blog, customer relationship management tools.

7.5- ”manage user support” ensures that customer queries are recorded and provided within agreed deadlines.

Page 16: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

The Interpretation and data analysis layer is specifically for statisticians and enables data mining. This is the effective data warehouse, and must support all kinds of statistical analysis on micro and macro data.

Data evaluation in this layer supports the design of any new

production processes or data re-use.

The results expected of the human activities in this layer should then be statistical “services” useful for other phases of the elaboration process, from the sampling, to the set-up of instruments used in the Integration Layer until the generation of new possible statistical outputs.

Activities on the Interpretation layer improve the S-DWH capabilities.

Interpretation and data analysis layer

Statistical Data Warehouse layered architecture

Page 17: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

1- Specify Needs:1.5 - check data availability

2- Design:2.1-design outputs2.2-design variable descriptions2.4-design frame and sample2.5-design statistical methodology2.6-design production systems

4- Collect:4.1-select sample

5- Process5.1-integrate data; 5.5-derive new variables and units;5.6-calculate weights; 5.7-calculate aggregate;

6- Analyze6.1-prepare draft output;6.2-validate outputs;6.3-scrutinize and explain;6.4-apply disclosure control;6.5-finalize outputs

7- Disseminate7.1-update output systems,

9- Evaluate9.1- gather evaluation inputs9.2- conduct evaluation

DWA - Interpretation and data analysis layerFrom the GSBPM we consider follow sub-processes :

Page 18: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

The integration layer is where all operational activities needed for all statistical elaboration processes are carried out.

This means operations carried out automatically or manually by operators to produce statistical information in a common IT infrastructure. This are recurring (cyclic) activities involved in the running of the whole or any part of a statistical production process.

Statistical elaboration should be organized in operational work flows for checking, cleaning, linking and harmonizing data-information in a common persistent area (Data Vault) where information is grouped by subject.

Integration layer

Statistical Data Warehouse layered architecture

Page 19: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

DWA - Integration layerSpecialized functions:

It is used for all integration and reconciliation activities of data sources. Into this layer we have the set of applications that perform the main ETL, which manages: inconsistent coding for the same object, the consistency is

obtained by coding defined by the data warehouse; adjustment of the different units of measurement and

inconsistent formats; alignment of inconsistent labels, same object named differently.

Usually the data are identified according to the definition contained in the metadata of the system.

incomplete or incorrect data; in this case operation may require human intervention to resolve issues not predictable a priori.  

data linking, in which different sources enable the creation of extended, or new, units of analysis.

Page 20: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

5- Process5.1-integrate data; 5.2-classify & code; 5.3-review, validate & edit; 5.4-impute; 5.5-derive new variables and statistical units: 5.6-calculate weights; 5.7-calculate aggregate;5.8-finalize data files

6- Analyze6.1-prepare draft output; the presence of this sub-process in this layer is strictly related to regular production process, in which the measures estimated are regularly produced, as should in the STS

DWA - Integration layerFrom the GSBPM we consider follow sub-processes :

Page 21: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

Integration layer, always prepares the elaborated information for Interpretation layer: from raw data, just uploaded into the S-DWH and not yet included in a production process, to micro/macro statistical data at any elaboration step of any production processes.

Otherwise in the interpretation layer it must be possible to easily access and analyze this micro/macro elaborated data of the production processes in any state of elaboration. This because methodologists should correct possible operational elaboration mistakes before, during and after any statistical production line, or design new elaboration processes for new surveys.

In this way the new concept or strategy can generate a feedback toward Integration layer which is able to correct, or increase the quality, of the regular production lines.

The Integration and Interpretation Layers are reciprocally functional to each other.

Statistical Data Warehouse layered architecture

Page 22: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

At Estat level, official statistical output is defined by the Commission Regulation and it can be modified annually in order to reflect changes in society or the economy.

The process of defining a new statistical output is realized by statistical experts in European task forces and working groups.

Statistical experts should investigate using available primary data, at National or European levels, to support any new statistical proposal, which should involve a re-use of existing data or the design of a new process.

Example of reciprocally functional between The Integration and Interpretation Layers

Statistical Data Warehouse layered architecture

Page 23: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

Example of reciprocally functional between The Integration and Interpretation Layers

Statistical Data Warehouse layered architectureST

ATIS

TICA

L DA

TA W

AREH

OU

SE

SOURCES LAYER

INTEGRATION LAYER

INTERPRETATION AND ANALYSIS LAYER

ACCESS LAYER

European task forces

3° new design strategy

4° new output

1° new proposal

2° new regulation

Page 24: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

SOURCE LAYER

INTEGRATION LAYER

INTERPRETATION LAYER

ACCESS LAYER

4 COLLECT

7 DISSEMINATE

5 PROCESS

6 ANALYSIS2 DESIGN 9 EVALUATE

Case: produce the necessary information

3 BUILD

Statistical Data Warehouse layered architecture

Page 25: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

SOURCE LAYER

INTEGRATION LAYER

INTERPRETATION LAYER

ACCESS LAYER

4 COLLECT

7 DISSEMINATE

5 PROCESS

6 ANALYSIS2 DESIGN 9 EVALUATE5 PROCESS

Case: re-use data to create new data

Statistical Data Warehouse layered architecture

Page 26: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

The Source Layer is the level in which we locate all the activities related to storing and managing internal or external data sources.

Typically, internal data are from direct data capturing carried out by CAWI, CAPI or CATI; while external data are from administrative archives, for example from Customs Agencies, Revenue Agencies, Chambers of Commerce, National Social Security Institutes.

Source Layer

Statistical Data Warehouse layered architecture

Page 27: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

DWA - Source layerSpecialized functions:

This level is responsible for, physically or virtually, storing the data from internal (surveys) or external (archives) sources for statistical purpose.

Typical data sources, in the context of business statistics, are data from : specific direct data capturing CAWI, CATI, CAPI; archive from Customs Agency; archive from Revenue Agency; archive from Chambers of Commerce; archive from National Social Security Institute.

Page 28: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

4- Collect:4.2-set up collection, ensures that the processes and technology are ready to collect data;

4.3-run collection, is where the collection is implemented, with different collection instruments being used to collect the data;

4.4-finalize collection, includes loading the collected data into a suitable electronic environment for further processing of the next layers.

DWA - Source layerFrom the GSBPM we consider follow sub-processes :

Page 29: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

Glossary:

Data mining: an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems.Data Warehouse is a central repository of data which is created by integrating data from one or more disparate sources.Data Mart is a subset of the data warehouse that is usually oriented to a specific business line or IT-Tolls.OLAP, OnLine Analytical Processing, is an approach to answering multi-dimensional analytical queries swiftly.Data Vault is a database modeling method that is designed to provide long-term historical storage of data coming in from multiple operational systems.Primary Information, is the original information which has been coded and cleaned in a common coherent environment.Secondary Information, is derived from primary information and could involve aggregate data.Business intelligence (BI) is a set of theories, methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information.BI-Tools are a set of IT-instruments used to analyze structured data and disseminates information with a topical focus.Graphics and publishing tools are able to generate graphs and tables for its users, this solution consists essentially in just a couple of steps to avoid inefficiency.

Page 30: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

Glossary:

CAPI, Computer-Assisted Personal Interviewing is an interviewing technique in which the respondent or interviewer uses a computer to answer the questions.CATI, Computer-Assisted Telephone Interviewing is a telephone surveying technique in which the interviewer follows a script provided by a software application. CAWI, Computer-assisted web interviewing is an Internet surveying technique in which the interviewer follows a script provided in a website.

Graphics and publishing tools are able to generate graphs and tables for its users, this solution consists essentially in just a couple of steps to avoid inefficiency.

Interoperability is the ability of two or more systems or components to exchange information and to use the information that has been exchanged.

Operational system is refered to a system that is used to process the day-to-day transactions of an organization.

Page 31: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

In statistics, a possible standard definition of the production process is the Generic Statistical Business Process Model (GSBPM), by the 9 phases:

1 Specify Needs,2 Design,3 Build,4 Collect,5 Process,6 Analyze,7 Disseminate,8 Archive,9 Evaluate.

Each phase is articulated by several sub statistical processes; which, according to process modeling theory, each sub-process should have a number of clearly identified attributes (input, output, owner, purpose, guide, enablers, feedback,..)

Modeling the Business Architecture

31

Page 32: 1 S-DWH Architecture (Recap): Summary :  A Data Warehouse as statistical production system.  S-DWH Architectural Domains  S- DWH layered architecture.

Generic Statistical Business Process Model (GSBPM)

32