ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation...
Transcript of ESSnet on BD II WPC: Results and alignment to BREAL...Principles, Practices (Implementation...
ESSnet on BD IIWPC:
Results and alignment to BREAL
Implementation Track Meeting
9 -10 December 2019
Wien, Austria
WPC: Overview Actors: AT, BG (Coordinator), DE, FI, IR, IT, NL, PL and UK
Main objectives: • Improve or update existing information (SBR)
• Maximize the quality and quantity of the statistical outputs (ICT survey)
• To achieve important economies of scale: opportunities for sharing of resources at ESS-level
5 tasks: • ESS web-scraping policy
• Reference Methodological Framework (RMF)
• Experimental Statistics
• Starter Kit for NSIs
• Quality template for statistical outputs
Deliverables – till end of October 2019 ESS web-scraping policy Template
Reference Methodological Framework (RMF), ver. 1.0
Functional production Prototypes
Experimental Statistics
ESS web-scraping policy Template Purpose and scope
8 main sub-section: Preamble, Background, Scope,Principles, Practices (Implementation guidelines), Roles andresponsibilities, Governance and Glossary
Valuable contribution from Eurostat
Current status
Need to discuss further
https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/0/0a/WPC_ESS_webscraping_policy_template.pdfNeed to discuss further
Reference Methodological Framework, v. 1.0
Purposes• to describe a complete OBEC statistics processing pipeline across
the main big data life cycle phases
• to be a reference guide and template for NSIs within ESS
• to be a relevant document for NSIs during the implementing process
Six chapters
https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/1/1f/WPC_Deliverable_C2_Reference_Methodological_Framework_v1.0.pdf
Reference Methodological Framework, v. 1.0
Four use cases are defined:
1. URLs Inventory
3. Data driven discovery of emergent
enterprise classifications
2. Variables in the ICT
survey
4. Experimental language statistics
Implementable Proof-of-concepts
Statistical products
Main concept
• online based enterprise characteristics (OBECs): anyattributes/characteristics, linked to businesses, that have beenextracted from webpages (e.g. enterprise’s URL ).
At input level • statistical unit: enterprises and/or webpages
• target population: enterprises included in the target population of ICT survey or just a sample thereof
• observation variables: variables are observed by using search engines, APIs and/or web scraping software
Statistical products
At output level• periodicity: at least once a year/or in accordance with the
observation period of the ICT survey
• statistical indicators
Rate of enterprises having websites
Rate of enterprises engaged in web sales on their website
Rate of enterprises that are present on social media
Rate of enterprises using Twitter for a specific purpose
Rate of enterprises having specific features of the website
Rate of enterprises working on upcoming/new phenomena, specifically AI and ML
Big data processing life cycle on OBECs
High level view on enterprise characteristics web scraping process
GSBPM phases recognized in the big data lifecycle
GSBPM 4 Collect: Acquisition/Recording OBECs• identifying a list of companies for which data will be collected
(target population)
• a list of potential website addresses is built
• a partial crawling data collection is done on potential websites
• chosen the “first-best” website for each enterprise
GSBPM phases recognized in the big data lifecycle
GSBPM 5 Process: Extraction, cleaning andannotation, integration, aggregation andrepresentation, etc.
• pre-processing the raw dataset (including tag identification)
• processing data into machine readable format (including data cleansing and text mining methods)
• data evaluation and improving (including imputation of missing data/data linkability)
GSBPM phases recognized in the big data lifecycle
GSBPM 6 Analyse: Modelling and interpretation• validate/reject candidate URLs
• calculation of enterprise characteristics through modelling and interpretation
• microdata is aggregated before publishing (e.g. NACE, NUTS, number of employees)
GSBPM 7 Disseminate• not significantly different from the traditional dissemination
processes
Functional production prototypes
URLs Inventory of enterprises• the methods used are described
• the software used is available on the wiki WPC Git Hub
• Python, Java, PHP, Node and R are the main web scraping languages
• 2 procedures were adopted: Java, Solr, R and PHP, MySQL
• process was tested several times in some WPC countries withsuccess and it is suitable for integration into the real statisticalproduction
• the result can be used to retrieve information from the enterprisewebsites on variables in the ICT survey, new variables, validation ofthe SBR or NACE classification
Functional production prototypes
Variables in the ICT usage in enterprise survey• five different indicators have been prepared
• two sets of methods used to provide output data for indicators
• the software used is available on the wiki WPC Git Hub
• the procedure how to use the prototype is described
• the process was tested in some WPC countries and not significantproportion of errors occurred
• can draw statistical conclusions based on the data from websites
Proof-of-concept prototype
Data driven discovery of emergent enterprise classifications• NLPT is used to discover new data driven classifications of
enterprises
• expected outcome is one or more new enterprise classifications andthe corresponding distributions of the scraped enterprises
• 2-phase process (data acquisition and data analysis) are described
• the software used is available on the wiki WPC Git Hub
• obtained results: didn’t match the expectations
• additional experiments is needed: to improve the quality of theobtained results
Proof-of-concept prototype
Experimental language statistics• clustering enterprises/website owners by descriptions of their
business or sustainability activities
• expected outcomes are: Business Activity Cluster and SustainableActivities Cluster
• outline of processing pipeline is presented
• procedure and software used are detailed described
• methodology developed should be considered a work in progress
• pipeline improvements needed
• explore bias/coverage in the dataset
Experimental statistics: OBECs
Dissemination of statistical outputs• Results - calculation of the statistical indicators defined in the RMF
1.0
• Integration of OBECs information at macro and/or micro level
• Methodology – explanation how the results were produced
• Experimental section on the wiki
WPC Application architecture
WPC Information architecture
Business process: URLs Inventory
Business Process: OBECs
2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729
What comes next?
Start to develop the Starter Kit for NSIs
Describing the reference architecture for OBEC data (RMF, Ver. 2.0)
Defining the implementation requirements at national level and at ESS level (RMF, Ver. 2.0)
Quality Report for OBEC outputs based on SIMS 2.0 (WPK deliverable)
Experimental statistics 2020
THANK YOU FOR YOUR ATTENTION!
Galya Stateva
Bulgarian National Statistical Institute