Use of web scraping and text mining techniques in the Istat survey on “Information and...

17
Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprisesGiulio Barcaroli(*), Alessandra Nurra(*), Marco Scarnò(**), Donato Summa(*) (*) Italian National Institute of Statistics (Istat) (**) Cineca Quality 2014 Quality 2014 Wien, June 2-5 2014

Transcript of Use of web scraping and text mining techniques in the Istat survey on “Information and...

Page 1: Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.

Use of web scraping and text mining techniques in the Istat survey on “Information and Communication

Technology in enterprises”

Giulio Barcaroli(*), Alessandra Nurra(*), Marco Scarnò(**), Donato Summa(*)

(*) Italian National Institute of Statistics (Istat)(**) Cineca

Quality 2014

Quality 2014Wien, June 2-5 2014

Page 2: Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.

The “ICT in enterprises” survey

In Italy, the survey investigates on a universe of 211,851 enterprises with at least 10 employees, by means of a sampling survey involving 19,186 of them (2011).

In the 2013 round of the survey, 8,687 indicated their website (45% of sampling respondent units).

The access to the indicated websites in order to gather information directly within them, gives different opportunities.

Quality 2014

Page 3: Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.

The “ICT in enterprises” survey

Quality 2014

Action Target

1 Substitute the traditional collection technique questionnaire-based, with an Internet as Data Source new one, for all suitable questions

Reduction of respondent burden

2 Integrate the information collected via questionnaire with the information collected via IaD

Increase of accuracy of estimates

3 Collect additional information Increase the offer of statistical information

Page 4: Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.

The “ICT in enterprises” survey

Quality 2014

Page 5: Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.

Quality 2014

Page 6: Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.

Predictive approach vs Content Analysis

Quality 2014

We assume that our target is to increase the accuracy of estimates by making use of data originating by the Internet as auxiliary data.

This particular case is based on the use of textual data as auxiliary data.

Texts are a “perfect” example of unstructured data, that is one of the characteristics of most Big Data.

First, the usual model-based approach will be followed, requiring the prediction of values at unit level: under this approach, the target is to maximise the correctness of classification for each unit in the reference population.

Next, a different approach will be illustrated, where the prediction of values at unit level is no more required and the target becomes to directly maximise the accuracy at the aggregate level (estimates accuracy).

Page 7: Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.

Predictive approach

Quality 2014

In a predictive approach, the subset of data related to sampled respondent units can be considered as the labeled data, and supervisioned learning methods can be applied.

In other words, the subset of 8,687 enterprises that indicated to have a website or a home page, and also responded to questions [B8a : B8g], can be considered as the training and test set by means of which different models can be estimated in order to predict answers to [B8a : B8g] questions for the whole reference population.

Texts (websites content)

Survey Microdata

Text and data mining

Model

Page 8: Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.

Predictive approach

Quality 2014

In our case, we can apply one among the supervisioned learning methods:

•Classification Trees;•“ensembles” (Bootstrap Aggregating, Adaptive Boosting, Random Forests);•Supervised Latent Dirichlet Allocation for classification (SLDA);•Neural Networks;•Logistic Regression;•Support Vector Machines;•Naïve Bayes.

Page 9: Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.

Evaluation of predictive models

Quality 2014

From the error matrix it is possible to compute the following indicators:

Indicator Expression Meaning

Accuracy(precision)

(TP+TN) / Total Rate of correctly classified cases

Sensitivity(true positives rate)

TP / (TP + FN) Rate of positive cases correctly classified

Specificity (true negatives rate)

TN / (FP+TN) Rate of negative cases correctly classified

Page 10: Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.

Evaluation of predictive models

Quality 2014

Application of different learners to predict question B8a “Online ordering or reservation or booking (Yes/No)”

Page 11: Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.

Evaluation of predictive models

Quality 2014

In general, when the misclassification cases are not balanced in absolute terms, the result is that the distribution of predicted values can be significantly different from the distribution of observed cases.

From these results, Naïve Bayes predictor can be considered as the most convenient, because even if its precision (78%) is the lowest, though sensitivity is the highest, specificity is good, and the alignment of observed and predicted proportion is perfect.

Page 12: Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.

Evaluation of predictive models

Quality 2014

Application of Naïve Bayes to predict all questions in section B8

Precision Sensitivity SpecificityObserved proportion

Predicted proportion

a) Online ordering or reservation or booking (web sales functionality)

0.78 0.50 0.86 0.21 0.21

b) Tracking or status of orders placed 0.82 0.49 0.85 0.18 0.11

c) Description of goods or services, price lists 0.62 0.44 0.79 0.48 0.32

d) Personalized content in the website for regular/ repeated visitors

0.74 0.41 0.78 0.09 0.23

e) Possibility for visitors to customize or design online goods or services

0.86 0.53 0.87 0.05 0.14

f) A privacy policy statement, a privacy seal or a website safety certificate

0.59 0.57 0.64 0.68 0.51

g) Advertisement of open job positions or online job application

0.69 0.52 0.78 0.35 0.33

Question B8:"indicate if the Website have any of the following facilities"

Performance of Naive Bayes

Page 13: Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.

Content analysis

Quality 2014

Page 14: Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.

Content analysis performance …

Quality 2014

In order to verify the robustness of the Content Analysis, we iterated 40 times the selection of a training set from survey data (each time producing an estimate of the proportion of web sales functionality), in correspondence to different rates of training set on the total (from 10% to 90%).

The results show correctness of the method until 30% of training rate, but a great variability of estimates for every rate.

Page 15: Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.

… compared to Naïve Bayes

Quality 2014

The same exercise has been carried out for Naive Bayes.

The results show a minimum bias (in the order of one or two percentage points), but a much lower variability.

Page 16: Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.

Future work

The experimented approach will be improved and extended in different directions:

1.with reference to the population of interest: we will consider the URLs of all the units belonging to the Business Register, and perform a mass scraping of related websites (in this case also experimenting more properly the high volume problems related to Big Data), considering the whole sampling subset of websites as a training set, so to obtain a model that can be applied the whole population. The aim is to produce estimates under a full predictive approach, reducing the sampling errors at the cost of introducing additional bias (both components of MSE should be evaluated);

2.with reference to the content of the questionnaire: the results obtained with the set of variables contained in the “B8” section of the questionnaire, will be evaluated also with the other suitable variables in the questionnaire (e-recruitment, e-procurement, use of social networks, etc.).

Page 17: Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.

Contacts

[email protected]

[email protected]

[email protected]

[email protected]

Thank you for your attention

Quality 2014