Using Internet as a Data Source for Official …...Session 6A - Big data sources: web scraping and...
Transcript of Using Internet as a Data Source for Official …...Session 6A - Big data sources: web scraping and...
Session 6A - Big data sources: web scraping and smart meters
Using Internet as a Data Source for Official
Statistics: a Comparative Analysis of Web
Scraping Technologies
Giulio Barcaroli(*) ([email protected]), Monica Scannapieco (*) ([email protected]), Marco
Scarnò (*) (m.scarnò@cineca.it), Donato Summa (*) ([email protected])
(*) Istituto Nazionale di Statistica (Istat)
(**) Consorzio Interuniversitario per il Calcolo Automatico (CINECA)
NTTS 2015
Web scraping is the process of automatically collecting information from the
World Wide Web, based on tools (called scrapers, internet robots, crawlers,
spiders etc.) that navigate, extract the content of websites and store
scraped data in local data bases for subsequent elaboration purposes.
We can distinguish two different kinds of web scraping:
1. specific web scraping, when both structure and content of websites to
be scraped are perfectly known, and crawlers just have to replicate the
behaviour of a human being visiting the website and collecting the
information of interest. Typical areas of application: data collection for
price consumer indices (ONS, CBS, Istat);
2. generic web scraping, when no a priori knowledge on the content is
available, and the whole website is scraped and subsequently processed
in order to infer information of interest.
NTTS 2015 Session 6A - Big data sources: web scraping and smart meters
Web scraping definition and types
Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa
NTTS 2015 Session 6A - Big data sources: web scraping and smart meters
An application on «ICT in enterprises» survey
Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa
Different solutions for the web scraping are being investigated, based on the
use of
(i) the Apache suite Nutch/Solr (https://nutch.apache.org) for crawling,
content extraction, indexing and searching results is a highly extensible
and scalable open source web crawler; it facilitates parsing, indexing,
creating a search engine, customizing search according to needs,
scalability, robustness, and scoring filter for custom implementations;
(ii) HTTrack (http://www.httrack.com/), a free and open source software tool
that permits to “mirror” locally a web site, by downloading each page that
composes its structure. In technical terms it is a web crawler and an
offline browser;
(iii) JSOUP (http://jsoup.org) permits to parse and extract the structure of a
HTML document. It has been integrated in a specific step of the
ADaMSoft system (http://adamsoft.sourceforge.net), this latter selected
as already including facilities that allow to handle huge data sets and
textual information.
NTTS 2015 Session 6A - Big data sources: web scraping and smart meters
Web scraping different techniques and tools
Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa
These techniques are evaluated by taking into account:
1. efficiency: number of websites actually scraped on the total and
execution performance;
2. effectiveness: completeness and richness of collected text that can
influence the quality levels of prediction.
NTTS 2015 Session 6A - Big data sources: web scraping and smart meters
Web scraping solutions evaluation
Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa
NTTS 2015 Session 6A - Big data sources: web scraping and smart meters
Web scraping techniques evaluation: efficiency
Solution # websites reached Average
number of
webpages
per site
Time
spent
Type of
Storage
Storage
dimensions
Nutch 7020 / 8550=82,1% 15,2 32,5
hours
Binary files
on HDFS
2,3 GB (data)
5,6 GB
(index)
HTTrack 7710 / 8550=90,2% 43,5 6,7 days HTML files on
file system
16, 1 GB
JSOUP 7835/8550=91,6% 68 11 hours HTML
ADaMSoft
compressed
binary files
500MB
Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa
NTTS 2015 Session 6A - Big data sources: web scraping and smart meters
Web scraping techniques evaluation: effectiveness
The evaluation of the
effectiveness of the different
solutions is being based on the
application of the steps of text
and data mining to collected
data in order to predict a subset
of the target information of the
survey.
The developed application is
available on the Adamsoft
website:
http://adamsoft.sourceforge.net/
appscripts.html
Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa
NTTS 2015 Session 6A - Big data sources: web scraping and smart meters
Prediction of survey information by text and data mining
Application of Naïve Bayes to predict all questions in section B8
Precision Sensitivity SpecificityObserved
proportion
Predicted
proportion
a) Online ordering or reservation or booking
(web sales functionality)0.78 0.50 0.86 0.21 0.21
b) Tracking or status of orders placed 0.82 0.49 0.85 0.18 0.11
c) Description of goods or services, price lists 0.62 0.44 0.79 0.48 0.32
d) Personalized content in the website for
regular/repeated visitors0.74 0.41 0.78 0.09 0.23
e) Possibility for visitors to customize or
design online goods or services 0.86 0.53 0.87 0.05 0.14
f) A privacy policy statement, a privacy seal
or a website safety certificate0.59 0.57 0.64 0.68 0.51
g) Advertisement of open job positions or
online job application0.69 0.52 0.78 0.35 0.33
Question B8:"indicate if the Website
have any of the following facilities"
Performance of Naive Bayes
Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa
So far, the three different solutions for web scraping have been applied to a
limited number of websites (related to the subset of enterprises respondents
in the sampling survey and declaring to have a website: 8,600).
Next step is the scraping of all the websites owned by the enterprises
included to the population of interest (212,000).
Two problems:
1. URLs retrieval: how to individuate all the websites owned by the
212,000 (between 90,000 and 100,000 are expected to own one
website);
2. massive scraping: how to increase efficiency when scaling a factor 10:
O(10^4) O(10^5)
NTTS 2015 Session 6A - Big data sources: web scraping and smart meters
Web scraping: from sample to whole population
Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa
NTTS 2015 Session 6A - Big data sources: web scraping and smart meters
Web scraping: URLs retrieval
General idea: for each enterprise:
1. Querying search engines with the enterprise denomination
2. Processing the first ten URLs retrieved in order to choose the right
one for the given enterprise
Processing:
a) matching of the enterprises information (denomination, fiscal code, etc.
available from administrative data) and the content of the first ten URLs
retrieved;
b) use of the subset of enterprises (from survey data) for which the correct
URL is known, as a training set in order to maximise the precision of the
choice function;
c) application of the choice function to the whole set.
Final scores are used to order the retrieved URLs to select the most probable
owned URL
Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa
Use of Nutch on top of MapReduce / Hadoop to harness
parallelism
Completed tasks: enhancement of Nutch by using the following plugins:
• HTML-Plugin (Nutch custom search) to retrieve HTML tags
• Metatag plugin (urlmeta) to add custom metatag information
integration of Nutch with analysis activities in order to execute the
whole process
Future task: deployment and execution of Adamsoft/JSOUP and Nutch
(HTTrack is abandoned due to its scalability problems)
on CINECA PICO platform (1,080 cores, 54 nodes, 6.9 TB RAM)
http://www.cineca.it/en/news/pico-cineca-new-platform-data-
analytics-applications
NTTS 2015 Session 6A - Big data sources: web scraping and smart meters
Web scraping: mass scraping
Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa
1. A first remark is that a scraping task can be carried out for different
purposes in an Official Statistics production environment, and the choice of a
unique tool for all the purposes may not always be possible.
2. As for this specific case, the final evaluation of the different solutions will
depend on the evaluation of the results of their execution for massive
scraping on an adequate platform (PICO).
3. Finally, we highlight that the scraping application here presented is a sort
of “generalized” scraping task, as it does not require any specific assumption
on the structure of the websites. In this sense it goes a step further with
respect to previous experiences.
NTTS 2015 Session 6A - Big data sources: web scraping and smart meters
Conclusions
Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa