ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation...

23
ESSnet on BD II WPC: Results and alignment to BREAL Implementation Track Meeting 9 -10 December 2019 Wien, Austria

Transcript of ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation...

Page 1: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

ESSnet on BD IIWPC:

Results and alignment to BREAL

Implementation Track Meeting

9 -10 December 2019

Wien, Austria

Page 2: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

WPC: Overview Actors: AT, BG (Coordinator), DE, FI, IR, IT, NL, PL and UK

Main objectives: • Improve or update existing information (SBR)

• Maximize the quality and quantity of the statistical outputs (ICT survey)

• To achieve important economies of scale: opportunities for sharing of resources at ESS-level

5 tasks: • ESS web-scraping policy

• Reference Methodological Framework (RMF)

• Experimental Statistics

• Starter Kit for NSIs

• Quality template for statistical outputs

Page 3: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

Deliverables – till end of October 2019 ESS web-scraping policy Template

Reference Methodological Framework (RMF), ver. 1.0

Functional production Prototypes

Experimental Statistics

Page 4: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

ESS web-scraping policy Template Purpose and scope

8 main sub-section: Preamble, Background, Scope,Principles, Practices (Implementation guidelines), Roles andresponsibilities, Governance and Glossary

Valuable contribution from Eurostat

Current status

Need to discuss further

https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/0/0a/WPC_ESS_webscraping_policy_template.pdfNeed to discuss further

Page 5: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

Reference Methodological Framework, v. 1.0

Purposes• to describe a complete OBEC statistics processing pipeline across

the main big data life cycle phases

• to be a reference guide and template for NSIs within ESS

• to be a relevant document for NSIs during the implementing process

Six chapters

https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/1/1f/WPC_Deliverable_C2_Reference_Methodological_Framework_v1.0.pdf

Page 6: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

Reference Methodological Framework, v. 1.0

Four use cases are defined:

1. URLs Inventory

3. Data driven discovery of emergent

enterprise classifications

2. Variables in the ICT

survey

4. Experimental language statistics

Implementable Proof-of-concepts

Page 7: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

Statistical products

Main concept

• online based enterprise characteristics (OBECs): anyattributes/characteristics, linked to businesses, that have beenextracted from webpages (e.g. enterprise’s URL ).

At input level • statistical unit: enterprises and/or webpages

• target population: enterprises included in the target population of ICT survey or just a sample thereof

• observation variables: variables are observed by using search engines, APIs and/or web scraping software

Page 8: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

Statistical products

At output level• periodicity: at least once a year/or in accordance with the

observation period of the ICT survey

• statistical indicators

Rate of enterprises having websites

Rate of enterprises engaged in web sales on their website

Rate of enterprises that are present on social media

Rate of enterprises using Twitter for a specific purpose

Rate of enterprises having specific features of the website

Rate of enterprises working on upcoming/new phenomena, specifically AI and ML

Page 9: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

Big data processing life cycle on OBECs

High level view on enterprise characteristics web scraping process

Page 10: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

GSBPM phases recognized in the big data lifecycle

GSBPM 4 Collect: Acquisition/Recording OBECs• identifying a list of companies for which data will be collected

(target population)

• a list of potential website addresses is built

• a partial crawling data collection is done on potential websites

• chosen the “first-best” website for each enterprise

Page 11: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

GSBPM phases recognized in the big data lifecycle

GSBPM 5 Process: Extraction, cleaning andannotation, integration, aggregation andrepresentation, etc.

• pre-processing the raw dataset (including tag identification)

• processing data into machine readable format (including data cleansing and text mining methods)

• data evaluation and improving (including imputation of missing data/data linkability)

Page 12: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

GSBPM phases recognized in the big data lifecycle

GSBPM 6 Analyse: Modelling and interpretation• validate/reject candidate URLs

• calculation of enterprise characteristics through modelling and interpretation

• microdata is aggregated before publishing (e.g. NACE, NUTS, number of employees)

GSBPM 7 Disseminate• not significantly different from the traditional dissemination

processes

Page 13: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

Functional production prototypes

URLs Inventory of enterprises• the methods used are described

• the software used is available on the wiki WPC Git Hub

• Python, Java, PHP, Node and R are the main web scraping languages

• 2 procedures were adopted: Java, Solr, R and PHP, MySQL

• process was tested several times in some WPC countries withsuccess and it is suitable for integration into the real statisticalproduction

• the result can be used to retrieve information from the enterprisewebsites on variables in the ICT survey, new variables, validation ofthe SBR or NACE classification

Page 14: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

Functional production prototypes

Variables in the ICT usage in enterprise survey• five different indicators have been prepared

• two sets of methods used to provide output data for indicators

• the software used is available on the wiki WPC Git Hub

• the procedure how to use the prototype is described

• the process was tested in some WPC countries and not significantproportion of errors occurred

• can draw statistical conclusions based on the data from websites

Page 15: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

Proof-of-concept prototype

Data driven discovery of emergent enterprise classifications• NLPT is used to discover new data driven classifications of

enterprises

• expected outcome is one or more new enterprise classifications andthe corresponding distributions of the scraped enterprises

• 2-phase process (data acquisition and data analysis) are described

• the software used is available on the wiki WPC Git Hub

• obtained results: didn’t match the expectations

• additional experiments is needed: to improve the quality of theobtained results

Page 16: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

Proof-of-concept prototype

Experimental language statistics• clustering enterprises/website owners by descriptions of their

business or sustainability activities

• expected outcomes are: Business Activity Cluster and SustainableActivities Cluster

• outline of processing pipeline is presented

• procedure and software used are detailed described

• methodology developed should be considered a work in progress

• pipeline improvements needed

• explore bias/coverage in the dataset

Page 17: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

Experimental statistics: OBECs

Dissemination of statistical outputs• Results - calculation of the statistical indicators defined in the RMF

1.0

• Integration of OBECs information at macro and/or micro level

• Methodology – explanation how the results were produced

• Experimental section on the wiki

Page 18: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

WPC Application architecture

Page 19: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

WPC Information architecture

Page 20: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

Business process: URLs Inventory

Page 21: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

Business Process: OBECs

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

Page 22: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

What comes next?

Start to develop the Starter Kit for NSIs

Describing the reference architecture for OBEC data (RMF, Ver. 2.0)

Defining the implementation requirements at national level and at ESS level (RMF, Ver. 2.0)

Quality Report for OBEC outputs based on SIMS 2.0 (WPK deliverable)

Experimental statistics 2020

Page 23: ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation guidelines), Roles and responsibilities, Governance and Glossary ... Rate of enterprises

THANK YOU FOR YOUR ATTENTION!

Galya Stateva

Bulgarian National Statistical Institute

[email protected]