Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web...
Transcript of Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web...
![Page 1: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/1.jpg)
Big Data ESSNet WP1:
Web Scraping for Job Vacancy Statistics
Nigel Swier
![Page 2: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/2.jpg)
Today’s talk is just the tip of the iceberg ….
![Page 3: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/3.jpg)
Potential of On-line Job Vacancy (OJV) Data
Current Official
Estimates (Survey)
Online data
Frequency Quarterly Real-time?
Industry Sector
Enterprise Size
Job type / skills
Geography
National Totals
More frequent More timely More granular Less burden Cheaper???
![Page 4: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/4.jpg)
The Partners
SGA-1 partners (from Feb 2016):
• UK (lead)
• Germany
• Slovenia
• Greece
• Italy
• Sweden
SGA-2 partners (from Aug 2017):
• Belgium
• France
• Portugal
![Page 5: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/5.jpg)
The People
Wiesbaden, April 2016 Rome, November 2016
Thessaloniki, Sept 2017 Milan, March 2018
![Page 6: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/6.jpg)
Six challenges with using
On-line Job Vacancy (OJV) data
for statistical purposes
![Page 7: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/7.jpg)
Not all jobs are advertised on-line. Coverage is
incomplete and not representative.
Recruitment by Channels, Germany 2016 (Source JVS)
Challenge 1:
![Page 8: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/8.jpg)
Challenge 2:
There is no definitive source of OJV data
• National Employment Agencies
• Job portals:
• Job Boards
• Job Search Engines
• Hybrid Portals
• Enterprise websites
• Data aggregators:
• Commercial providers
• CEDEFOP
Duplication
Image: Creative Commons
![Page 9: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/9.jpg)
Challenge 3:
Much OJV data is unstructured. Text processing
and analysis is required to extract useful
information.
![Page 10: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/10.jpg)
Challenge 4:
Some job ads are not within the scope of official
statistics definitions of a job vacancy
• International Jobs
• Ghost Vacancies
• Unpaid Student Internships
All images: Creative Commons
![Page 11: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/11.jpg)
Challenge 5:
The official definition of a job vacancy does not
correspond directly to the concept of a live job ad
![Page 12: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/12.jpg)
Challenge 5:
The official definition of a job vacancy does not
correspond directly to the concept of a live job ad
One ad, multiple
vacancies
![Page 13: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/13.jpg)
Challenge 6:
The specific job vacancy data landscape varies
between countries:
• Size of country and number of job portals
• Digital penetration
• Characteristics of the economy and the labour market
• The role of National Employment Agencies
• Differences in the Job Vacancy Survey
• Language(s)
• Legal Issues
Image: Creative Commons
![Page 14: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/14.jpg)
Summary of Challenges
OJV data is not representative of the labour market and
there are definitional issues that make it difficult to
compare directly with official statistics
Image: Creative Commons
![Page 15: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/15.jpg)
Data Access
![Page 16: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/16.jpg)
OJV Data Landscape
Job Boards
Private Employment
Agencies
Employers
Job Search
Engines
National Employment
Agency
Enterprise
Websites
Data Aggregators
Public Policy
Cedefop
Official Job Vacancy
Statistics
![Page 17: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/17.jpg)
Approaches to Data Access
• Direct web scraping
• Point and click
• Progammatic (e.g. Python Scrapy)
• Web-scraping enterprise websites
• Agreed Access
• National employment agency
• Private job portals
• Commercial providers
• CEDEFOP
Images: Creative Commons
![Page 18: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/18.jpg)
Data Access by Country
![Page 19: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/19.jpg)
Data Handling
• Data cleaning and deduplication
• Text analysis and classification
• Flow to stock transformation
![Page 20: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/20.jpg)
Classifying textual data with machine learning
Can industry
and occupation
be classified
from a job ad?
Occupation is fairly straightforward in this case
Industry is more difficult. This company is an employment
agency not the employer. But there are clues….
![Page 21: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/21.jpg)
Text pre-processing and feature extraction
• Text Standardisation
• Stop word removal
• White/blacklists
• Stemming (e.g. “making” => “mak”)
• Lemmatization:
• Standard (e.g. “making” => “make”)
• Sophisticated:
• Feature Extraction:
• Bag of words / n-grams
• Term frequency
Image: Creative Commons
![Page 22: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/22.jpg)
Machine Learning
• Training data
• Libraries:
• Scikit Learn
• Rtexttools
• Best performing algorithms/approaches
• SVM with Linear Kernel (Portugal)
• Logistic Regression (France)
• Multinomial Naïve Bayes (Germany)
• Ensemble (Belgium)
Images: Creative Commons
![Page 23: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/23.jpg)
Results: Classifying Occupation
Occupation Coding Confusion Matrix, Portugal Study
![Page 24: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/24.jpg)
Results: Classifying Industry
NACE Coding Confusion Matrix, Belgium Study
![Page 25: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/25.jpg)
Other approaches to classifying data
• String matching
• Levenshtein distance
• Jaccard Similarity
• Phrase-based classification (PBC)
• Controlled vocabularies
• More precision
• Greater transparency
• Less Scalable
![Page 26: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/26.jpg)
Methodology
• Quality Assessment Frameworks
• Assessing Coverage
• Matching and Linking
• Time series analysis / Nowcasting
![Page 27: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/27.jpg)
Assessment against aggregates
![Page 28: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/28.jpg)
Assessment against statistical units
Also, illustrates an LSTM neural network nowcasting model using multiple OJV sources
JV count comparison for a selected company, UK Study
![Page 29: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/29.jpg)
Time Series Analysis
![Page 30: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/30.jpg)
Time Series Analysis
![Page 31: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/31.jpg)
Statistical Outputs
![Page 32: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/32.jpg)
Experimental Outputs For Slovenia
![Page 33: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/33.jpg)
Job Vacancy Flash Estimates
![Page 34: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/34.jpg)
Job Vacancies by Local Areas
![Page 35: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/35.jpg)
Key Conclusions (and Questions)
• Agreed access arrangements are generally better than direct
web scraping
• OJV data cannot replace the Job Vacancy Survey
• OJV data does not correspond to target concepts and only
measures part of the labour market. How useful are these
measures?
• If useful, how should these measures be presented alongside
the official estimates?
• A successful collaboration with CEDEFOP is essential. How do
we get the best possible quality data for official statistics
purposes?
![Page 36: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/36.jpg)
Future Perspectives
![Page 37: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/37.jpg)
Disruptive technologies
![Page 38: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/38.jpg)
![Page 39: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/39.jpg)
Drivers of Cedefop RLMI work
Complement skills intelligence toolkit
Better labour market information for better policies
Lack of comparable data and systematic analysis
![Page 40: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/40.jpg)
Key characteristics of the project • Based on previous feasibility study
– Interesting and unique set of results – Data used for Eurostat hackathon – Data used for various activates of WP 1
• Key features – Preselected well analysed sources – All 28 EU MS / all EU official languages – Skills in ESCO v.1 + other attributes
• Time horizon – Early release (Dec. 2018) – CZ, DE, ES, FR, IT, IE, UK – Final version (Dec 2020)
![Page 41: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise](https://reader033.fdocuments.us/reader033/viewer/2022042804/5f4fd03188c95601cb5f1464/html5/thumbnails/41.jpg)
Connect to ESS net and Eurostat
• Valuable two ways cooperation
– Big Data Task Force
– EU hackathon
– Data4policy Sherpa Meeting
– ESS net WP1
• What next?
– Validation
– Production