WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big...
Transcript of WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big...
WP1 - Web Scraping for Job
Vacancy Statistics
Big Data ESSNet Workshop. Sofia. 24-25 February 2017
Nigel Swier
Rationale
Current Official
Estimates (Survey)
Online data
Frequency Quarterly Real-time?
Industry Sector
Enterprise Size
Job type / skills
Sub-national
National Totals
More frequent More timely More granular Less burden Cheaper???
Participants (SGA-1)
• United Kingdom (lead)
• Germany
• Sweden
• Slovenia
• Italy
• Greece
Data Access
1. Web scraping Job Portals
5. Commercial Suppliers
3. Web scraping Enterprise Websites
2. Job Portal APIs
4. Public Sector Agencies
Broad Approach
• Understand the landscape of web-based job vacancy
data in each country
• Focus first on job portals. later explore enterprise
websites
• Try to replicate existing outputs. then investigate
opportunities to produce new types of output.
• Develop specific approaches that are appropriate to
the circumstances in each country
• Develop common approaches where possible
Key Concepts
Target Measure:
Job Ad
Target Concept:
Job Vacancy
Key Concepts
Job Ad Job Vacancy
Key Concepts
Job Ad Job Vacancy
Key Concepts
Job Ad Job Vacancy
“Ghost “ Vacancy
Target Population: All job vacancies
Coverage Issues
Advertised on enterprise website
Advertised on a job portal
‘Ghost’
Vacancies
Employing business
is identifiable
Advertised through
an agency
Outline Approach to Data Integration
Counts from online
sources Enterprise A
Enterprise B
Enterprise C
Enterprise D
Enterprise E
Survey Estimates Enterprise A
Enterprise B
Enterprise C
Enterprise F
Enterprise G
Scaling Factors
(by NACE?)
Matching
Integrated data set Enterprise A
Enterprise B
Enterprise C
Enterprise D
Enterprise E
Enterprise F
Enterprise G
Enterprise H
Enterprise I
Enterprise J
Business Register Enterprise A
Enterprise B
Enterprise C
Enterprise D
Enterprise E
Enterprise F
Enterprise G
Enterprise H
Enterprise I
Enterprise J
1. Scale online
data to survey
estimates
2. Apply scaling
factors to on-
line data
3. Use survey
estimates
4. Modelled
estimates
1. Survey and
Online
2. Online only
3. Survey only
4. Neither
survey or
online
Total = Survey Estimate
Conclusion
• Data from on-line job ads are very rich. but
complex and unstructured
• Difficult to align to established statistical
concepts
• Need to understand coverage issues and
how to tackle them
• Surveys will still be needed and so the
challenges are around integrating different
sources.
Dissemination Workshop Sofia. 22-23 February 2017
Hellenic Statistical Authority
Christina Pierrakou – Eleni Bisioti
Dissemination Workshop Sofia. 22-23 2017
ELSTAT
• Web Scraped Data Structure
• Tools and Environment
• Web scraping experiment
• Matching Results
23
Dissemination Workshop Sofia. 22-23 2017
ELSTAT
Dissemination Workshop Sofia. 22-23 2017
ELSTAT
Web scraping Tools
Job Portal Ads
Import.io Content Grabber
Pre-processed Data Processing – Deduplication
Data Analysis
Dissemination Workshop Sofia. 22-23 2017
ELSTAT
Dissemination Workshop Sofia. 22-23 2017
ELSTAT
Activities of head offices;
management consultancy
activities
20%
Employment activities
14%
Manufacture of food
products
10% Telecommunications
10%
Education
9%
Wholesale trade,
except of motor
vehicles and
motorcycles
6%
Accommodation
5%
Human health
activities
4%
Advertising and
market research
3%
Office administrative, office
support and other business
support activities
3%
Others
16%
Dissemination Workshop Sofia. 22-23 2017
ELSTAT
1 Managers 5%
2 Professionals 16%
3 Technicians and Ass. Professionals
7%
4 Clerical Support Workers
12%
5 Services and Sales Workers
49%
6 Skilled Agricultural. Forestry and Fishery
Workers 0.1%
7 Craft and Related Trades Workers
5%
8 Plant and Machine Operators and
Assemblers 1%
9 Elementary Occupations
5%
Dissemination Workshop Sofia. 22-23 February 2017
WP1-Webscraping job vacancies.
SURS experiment
Boro Nikic
ESSnet Big Data Dissemination
Workshop. Sofia
23-24. 2. 2017
Current Survey on JV (1)
• EU regulation: Number of JV ads broken down by activity (B-S) and
size (10+ employees)
• Population: Legal units with at least 1 employee
– 61.544 legal units (without public sector )
• Sample
– 8.942 legal units (probability sample)
– + cca. 3.300 legal units from public sector
– 12.200 enot 20 % of poulation
STRATUM Size class Number of
units Rate
0 1-2 zaposleni osebi 2.095
23.4
1 3-9 zaposlenih oseb 3.570 39.9
2 10 - 49 zaposlenih oseb 2.065 23.1
3 50 - 249 zaposlenih oseb 1.033 11.6
4 250 in več zaposlenih oseb 179 2.0
Skupaj 8.942 100.0
31
Slovenian Job Portals
There are around 30 Job portals in Slovenia. Two of the most
important ones cover more then 95% JV ads.
Since May 2016 weakly collection of data from those two
portals.
32
Structure of the scraped data
33
Position Enterprise Location Date
Pizzopek m/ž Trummer osebni servis d.o.o. Maribor Objavljeno: 15.04.2016
Vodja kuhinje m/ž Trummer osebni servis d.o.o. Maribor Objavljeno: 15.04.2016
Knjigovodja m/ž SPORTINA Bled d.o.o. Lesce Objavljeno: 15.04.2016
Asistent vodji produktov m/ž v
Mariboru
Trenkwalder kadrovske storitve d.o.o. Maribor Objavljeno: 15.04.2016
Record linkage with BR
34
1 Merging by unique (short. complete. abbreviated ) name of enterprise
2 Merging by unique (short. complete) name of enterprise. The location of
enterprise is
removed from the name
4-6 merging by non-unique (short. complete. abbreviated ) name of enterprise
and location of the work/enterprise
7-8 Record linkage by using distance function (short. complete. name of
enterprise)
10 Manual (agencies. bigger enterprises)
11 Record linkage by using distance function (complete name of enterprise)
0 Unmerged
31.5. 31.8.
N % N %
1.1 1401 72.03 1271 75.61
1.2 365 18.77 250 14.87
1.3 3 0.15 2 0.12
2.1 22 1.13 24 1.43
2.2 9 0.46 13 0.77
4 11 0.57 15 0.89
7 16 0.82 27 1.61
8 11 0.57 9 0.54
10 82 4.22 49 2.91
11 17 0.87 2 0.12
0 8 0.41 19 1.13
TOTAL 1945 100 1681 100
Duplicates
Key for merging: name of enterprise. job title. location
35
Števec 31.5. 31.8.
0 817 818
0.1 518 63.40 538 65.77
0.2 241 29.50 221 27.02
0.3 63 7.71 59 7.21
1 563 356
2 285 254
3 42 47
4 5 2
0 – number of distinct enterprises
0.1 - number of enterprises which advertise only on MojeDelo
0.2 - number of enterprises which advertise only on MojaZaposlitev
0.3 - number of enterprises which advertise on both Job potrtals
1 - number of enterprises with unique ads on MojeDelo
2 - number of enterprises with unique ads on MojaZaposlitev
3 - number of enterprises with more than one ads on MojaZaposlitev
and MojeDelo
4 - number of enterprises with more than ads on MojaZaposlitev and
MojeDelo. Number of ads on both portals doesn't match.
Weakly movement of Number of JV ads
0
500
1000
1500
2000
2500
3000
Skupaj
Skupaj - cisti
Moje delo
Moje delo - cisti
Moja zaposlitev
Moja zaposlitev - cisti
36
IT tools involved in scraping Job
portals
37
SCRAPING
OUTPUT
FILE
STORAGE
STATISTICAL
PRODUCTION
SAS Contextual Analytics Data Scraping Studio
Methodology of scraping of enterprise
websites (1)
38
Identify URL links of
enterprises
Identify sub links
which potentially
contain JV ads
Employing machine
learning techniques
detect the JV ads
from list of contents of
sublinks
Detect variables
(locaation. job title.
skills...)
Not implemented yet
Coverage: Sample vs. scraped &
admin data
39
Reported
data
Scraped &
admin data
Job
portals
Enterprise
websites
Number of JV ads 4312 2321 1073 262
Percentage 100% 54% 25% 6%
Strata Questionnaire BD Sources Percentge
1 employee 67 16 24%
1-9 employees 470 173 37%
10-49 employees 923 362 39%
50-249 employees 1681 744 44%
250 employees 1119 782 70%
Planned activities (1)
Additional question in questionnaire for regular JV survey (2017)
Main goal: collection the info about mode of advertising of JV ads
Side goal: collection of URLs of enterprises
40
Job portals
Enterprise websites
Employment agencies
Newspapers
Social networks
(Linkedln.Facebook...)
Planned activities (1)
December 2016: Meeting with the Employment Service of Slovenia (ESS)
Aim: deeper knowledge about cooperation between enterprises and ESS
March 2017: Meeting with the Employment agencies
Aim: cooperation SURS and agencies
41
Agency
Number
of
eployees
1 AC d.o.o. 79
2 ADECCO H.R. d.o.o. 3255
3 KARIERA D.O.O. 1296
4 KI INTERIM D.O.O. 167
5 KOROTAJ D.O.O. 330
6 MANPOWER D.O.O. 467
7 PAPIR SERVIS D.O.O. 518
8 TRENKWALDER D.O.O. 1242
Planned activities (3)
Processing of Job portals data in 2017:
1. Weekly movement of number of JV ads broken down by main economic
activities (and by size groups)
2. Testing the models for grossing up Job portals (and other) JV data on
target population level (auxiliary informations from Statistical register of
employees)
3. Record linkage with Standard Occupational Classification System
4. Integration of data from Job portals. enterprise websites and data from
administrative sources (internal pilot at SURS) .
42
facebook.com/statistiskacentralbyranscb
@SCB_nyheter statistiska_centralbyran_scb www.linkedin.com/company/scb
Internet job portals as a
source for job vacancy
statistics
Data sources
Swedish Employment Agency
2 236 663 advertisements (January 2012 - June 2016)
Statistics Sweden Job Vacancy Survey
410 393 business records (January 2012 - June 2016)
legal units (public sector)
local units (private sector)
Statistics Sweden Business Register
In progress: contacts with three private job portals
Swedish Employment Agency
Job portal Platsbanken (PB) covers about 40% of the market
Information is entered manually at the Agency. on the web by employers. or submitted by files
Several required variables for advertising on the web (i.e. company name and id. address. occupation title. description and requirements of the job. posting date. etc.)
Rules to avoid invalid values. duplicate advertisements. old advertisements. etc.
Number of days an advertisement is on the web: mean 25 days. median 21 days
Quality of the PB data
Checking invalid values: few invalid records on important identifying variables. dates. and important variables like occupation and type of employment
Coverage:
Recruiting/outsourcing companies: top three companies are behind 3% of the advertisements
Big cities appear frequently (Stockholm. Gothenburg. Malmö. Uppsala)
High skilled jobs frequent (> 40%)
Idea: use the text of the advertisements in PB and the high quality of the structured variables to find a good method for text analysis. Use the method on other portals with lower quality
Matching PB with Business Register
94-99% match on organization id. municipality.
occupation code. NACE
So far: Very difficult to match on company name
For the matched counts:
Number of work places
%
1-10 61
11-250 30
251-1000 6
1000< 2
0/Null 2
Number of employees
%
0-9 23
10-49 17
50-99 6
100-200 6
200< 46
0/Null 1
Matching PB with Job Vacancy Survey
PB data are first aggregated and grouped
according to the variables organization id.
municipality code. year. and month.
PB: 951 195 rows
Job Vacancy: 410 393 rows
20% of data can be matched
Public sector 70% match
Private sector 16% match
Work in progress
Employers on sector. %
Sector PB Survey Business Register
Businesses Employees
Private 70 92 90 68
Non-profit organizations
1.5 1 10 0.02
Public 17 7 0.05 30
Missing 12 - - -
Three other job portals
Metrojobb
Data sources: Employment agency. manually. files. web scraping
First data through API
CareerBuilder
Data sources: manually. files. through customer systems
Textkernel: “semantic search” (web scraping) Jobfeed (not in Sweden)
Jobbsafari
Planned meeting in Copenhagen in March
Web scraping
Issues:
Validation
Linking
Duplicates
Coverage
Etc…