Using Internet as a Data Source for Official …...Session 6A - Big data sources: web scraping and...

Session 6A - Big data sources: web scraping and smart meters

Using Internet as a Data Source for Official

Statistics: a Comparative Analysis of Web

Scraping Technologies

Giulio Barcaroli(*) ([email protected]), Monica Scannapieco (*) ([email protected]), Marco

Scarnò (*) (m.scarnò@cineca.it), Donato Summa (*) ([email protected])

(*) Istituto Nazionale di Statistica (Istat)

(**) Consorzio Interuniversitario per il Calcolo Automatico (CINECA)

NTTS 2015

mailto:[email protected]


mailto:m.scarnò@cineca.it


Web scraping is the process of automatically collecting information from the

World Wide Web, based on tools (called scrapers, internet robots, crawlers,

spiders etc.) that navigate, extract the content of websites and store

scraped data in local data bases for subsequent elaboration purposes.

We can distinguish two different kinds of web scraping:

1. specific web scraping, when both structure and content of websites to

be scraped are perfectly known, and crawlers just have to replicate the

behaviour of a human being visiting the website and collecting the

information of interest. Typical areas of application: data collection for

price consumer indices (ONS, CBS, Istat);

2. generic web scraping, when no a priori knowledge on the content is

available, and the whole website is scraped and subsequently processed

in order to infer information of interest.

NTTS 2015 Session 6A - Big data sources: web scraping and smart meters

Web scraping definition and types

Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa


An application on «ICT in enterprises» survey


Different solutions for the web scraping are being investigated, based on the

use of

(i) the Apache suite Nutch/Solr (https://nutch.apache.org) for crawling,

content extraction, indexing and searching results is a highly extensible

and scalable open source web crawler; it facilitates parsing, indexing,

creating a search engine, customizing search according to needs,

scalability, robustness, and scoring filter for custom implementations;

(ii) HTTrack (http://www.httrack.com/), a free and open source software tool

that permits to “mirror” locally a web site, by downloading each page that

composes its structure. In technical terms it is a web crawler and an

offline browser;

(iii) JSOUP (http://jsoup.org) permits to parse and extract the structure of a

HTML document. It has been integrated in a specific step of the

ADaMSoft system (http://adamsoft.sourceforge.net), this latter selected

as already including facilities that allow to handle huge data sets and

textual information.


Web scraping different techniques and tools


https://nutch.apache.org/



http://www.httrack.com/

http://www.httrack.com/

http://jsoup.org/

http://jsoup.org/

http://adamsoft.sourceforge.net/


These techniques are evaluated by taking into account:

1. efficiency: number of websites actually scraped on the total and

execution performance;

2. effectiveness: completeness and richness of collected text that can

influence the quality levels of prediction.


Web scraping solutions evaluation



Web scraping techniques evaluation: efficiency

Solution # websites reached Average

number of

webpages

per site

Time

spent

Type of

Storage

Storage

dimensions

Nutch 7020 / 8550=82,1% 15,2 32,5

hours

Binary files

on HDFS

2,3 GB (data)

5,6 GB

(index)

HTTrack 7710 / 8550=90,2% 43,5 6,7 days HTML files on

file system

16, 1 GB

JSOUP 7835/8550=91,6% 68 11 hours HTML

ADaMSoft

compressed

binary files

500MB



Web scraping techniques evaluation: effectiveness

The evaluation of the

effectiveness of the different

solutions is being based on the

application of the steps of text

and data mining to collected

data in order to predict a subset

of the target information of the

survey.

The developed application is

available on the Adamsoft

website:


appscripts.html


http://adamsoft.sourceforge.net/appscripts.html




Prediction of survey information by text and data mining

Application of Naïve Bayes to predict all questions in section B8

Precision Sensitivity SpecificityObserved

proportion

Predicted

proportion

a) Online ordering or reservation or booking

(web sales functionality)0.78 0.50 0.86 0.21 0.21

b) Tracking or status of orders placed 0.82 0.49 0.85 0.18 0.11

c) Description of goods or services, price lists 0.62 0.44 0.79 0.48 0.32

d) Personalized content in the website for

regular/repeated visitors0.74 0.41 0.78 0.09 0.23

e) Possibility for visitors to customize or

design online goods or services 0.86 0.53 0.87 0.05 0.14

f) A privacy policy statement, a privacy seal

or a website safety certificate0.59 0.57 0.64 0.68 0.51

g) Advertisement of open job positions or

online job application0.69 0.52 0.78 0.35 0.33

Question B8:"indicate if the Website

have any of the following facilities"

Performance of Naive Bayes


So far, the three different solutions for web scraping have been applied to a

limited number of websites (related to the subset of enterprises respondents

in the sampling survey and declaring to have a website: 8,600).

Next step is the scraping of all the websites owned by the enterprises

included to the population of interest (212,000).

Two problems:

1. URLs retrieval: how to individuate all the websites owned by the

212,000 (between 90,000 and 100,000 are expected to own one

website);

2. massive scraping: how to increase efficiency when scaling a factor 10:

O(10^4) O(10^5)


Web scraping: from sample to whole population



Web scraping: URLs retrieval

General idea: for each enterprise:

1. Querying search engines with the enterprise denomination

2. Processing the first ten URLs retrieved in order to choose the right

one for the given enterprise

Processing:

a) matching of the enterprises information (denomination, fiscal code, etc.

available from administrative data) and the content of the first ten URLs

retrieved;

b) use of the subset of enterprises (from survey data) for which the correct

URL is known, as a training set in order to maximise the precision of the

choice function;

c) application of the choice function to the whole set.

Final scores are used to order the retrieved URLs to select the most probable

owned URL


Use of Nutch on top of MapReduce / Hadoop to harness

parallelism

Completed tasks: enhancement of Nutch by using the following plugins:

• HTML-Plugin (Nutch custom search) to retrieve HTML tags

• Metatag plugin (urlmeta) to add custom metatag information

integration of Nutch with analysis activities in order to execute the

whole process

Future task: deployment and execution of Adamsoft/JSOUP and Nutch

(HTTrack is abandoned due to its scalability problems)

on CINECA PICO platform (1,080 cores, 54 nodes, 6.9 TB RAM)

http://www.cineca.it/en/news/pico-cineca-new-platform-data-

analytics-applications


Web scraping: mass scraping


http://www.cineca.it/en/news/pico-cineca-new-platform-data-analytics-applications



















1. A first remark is that a scraping task can be carried out for different

purposes in an Official Statistics production environment, and the choice of a

unique tool for all the purposes may not always be possible.

2. As for this specific case, the final evaluation of the different solutions will

depend on the evaluation of the results of their execution for massive

scraping on an adequate platform (PICO).

3. Finally, we highlight that the scraping application here presented is a sort

of “generalized” scraping task, as it does not require any specific assumption

on the structure of the websites. In this sense it goes a step further with

respect to previous experiences.


Conclusions


Using Internet as a Data Source for Official …...Session 6A - Big data sources: web scraping and...

Documents

Transcript of Using Internet as a Data Source for Official …...Session 6A - Big data sources: web scraping and...