job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1)...

Post on 25-Jul-2020

13 views 0 download

Transcript of job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1)...

Frantisek (Fero) Hajnovicfrantisek.hajnovic@ons.gov.uk

Big data team

Web scraping

job vacancies

(ESSnet on Big Data - Work package 1)

Outline

Sample based

scraping

Full-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Outline

Sample based

scraping

Full-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Outline

Sample based

scraping

Full-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Outline

Sample based

scraping

Full-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Outline

Sample based

scraping

Full-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Outline

Sample based

scraping

Full-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Sample based

scrapingFull-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Proof of concept

● P.O.C. on random sample of 50 (large) companies○ Survey vs. Company websites vs. Job portals

Survey CW Indeed ...

Tesco 2345 1351 1525 ...

HSBC 321 243 210 ...

... ... ... ... ...

Useful quick insights

● Which portal is better?

● “Boots” problem

● Gap: survey - online

pros and cons

+ Quick and simple scrapers

+ Entries already linked (matched)

+ Lightweight (less risk) - at least

for small sample

- Sample bias

- Effort to increase the sample

Sample based

scrapingFull-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Full-size scraping

● Directory

● “Proper” spiders○ Careerjet

○ CV-library

○ Universal job match

● T&Cs / robots.txt

+ Lot of data

+ Not influenced by sample

━- More “risky” scraping

- Need to match

Sample based

scrapingFull-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Matching company names

“Milton Keynes Borough Council“

“MILTON KEYNES COUNCIL INCL EDUCATION EXCL SCHOOLS WITH EXTERNAL PAYROLL PROVIDERS“

25

34

company name JV count

Survey

Careerjet

Milton Keynes council

● Casing, stop-words, (TF-)IDF scores, INCL/EXCL

● 434 entries matched (3.7%)

company

Sample based

scrapingFull-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Scraping company websites

● One website, one spider - no problem

● 50 websites

Specific spider code

● Name and rep. unit

● URL

● Extraction○ XPath

○ Regex pattern

Scraping company websites

● Type of access to the relevant HTML○ Simple HTTP. E.g. Caring homes○ Selenium. E.g. Care UK

● Obtaining count○ Direct count. E.g. Caring homes○ Counting vacancies. E.g. University of Portsmouth

● Pagination○ Not necessary. E.g. Caring homes○ Necessary. E.g. Somerset county

Scraping company websites

Scraping company websites

Sample based

scrapingFull-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites

DEMO!

DEMO!

Sample based

scrapingFull-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Project architecture

Python project

Spiders(scrapy, xpath, regex, beautiful soup)

Sample-based

PSB (portal sample-based)

Full-size

PFS (portal full-size)

Inde

ed

Car

eerje

t

Bric

k7

CV-

libra

ryU

JM (sample of 50 companies)

Emailer(mailjet)

fhajnovic.ons@gmail.com

Tests, scripts, notebooks...(nose, bash, jupyter,...)

Car

eerje

t

CW (comp. websites)(selenium)

...

Emails from scraping

Deploying project

Python project Google cloud

Spiders(scrapy, xpath, regex, beautiful soup)

Sample-based

PSB (portal sample-based)

Full-size

PFS (portal full-size)

CV-

libra

ryU

JM

Emailer(mailjet)

Deploy(bash)

“Managing” instance

Run scraping(Cron-job)

Mongo DB instance

24h

fhajnovic.ons@gmail.com

Turn on/off,Store data

Car

eerje

t

CW (comp. websites)(selenium)

Tests, scripts, notebooks...(nose, bash, jupyter,...)

(sample of 50 companies)In

deed

Car

eerje

t

Bric

k7

...

Technologies used

Sample based

scrapingFull-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Visualise the data

Python project Google cloud

Spiders(scrapy, xpath, regex, beautiful soup)

Sample-based

PSB (portal sample-based)

Full-size

PFS (portal full-size)

CV-

libra

ryU

JM

Emailer(mailjet)

Deploy(bash)

“Managing” instance

Run scraping(Cron-job)

Mongo DB instance

24h

fhajnovic.ons@gmail.com

Turn on/off,Store data

Car

eerje

t

CW (comp. websites)(selenium)

Dashboard(flask, bokeh, js)

Visualise

Tests, scripts, notebooks...(nose, bash, jupyter,...)

(sample of 50 companies)In

deed

Car

eerje

t

Bric

k7

...

Dashboard

Sample based

scrapingFull-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!

DEMO!

Sample based

scrapingFull-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Scatter plot with best fit

Bland-Altman plot with KDE

BA-plots side by side

Krippendorff’s alpha

● inter-rater agreement

● <-1, 1>○ 1 = perfect agreement

○ 0 = absence of reliability

○ -1 = systematic disagreement

K.A. = 0.755

Comparing portals

Sample based

scrapingFull-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Nowcasting

1 month

Survey data

Scraped data... ... ...

1 day traintrain train trainpredict

● In total

● Per industry

● Per company

Nowcasting survey entry

● Possible model inputs○ Scraped values

○ Previous survey values

○ Company parameters

○ Industry (dummy 0-1 coding)

○ Outlying factor

○ …● Possibly lots of training data

○ 6k entries in survey

○ monthly

For company X at time tPortal 1Portal 2

…Portal n

Comp. website

Survey(t-1)Survey(t-2)

…Survey(t-k)

Industry 1Industry 2

…Industry m

Employee size

Regression(neural network?)

Survey(t)

Outlying factor

Scale up and expand!

● Why not?

○ New FS spider ⟶ 1 - 3 days

○ New SB spider ⟶ 1 - 3 days + sample

○ New CW spider ⟶ 10 minutes

● Sample ⟶ 100

● Improve matching

● Data from partners

The deadly triangle

thanks!

Questions?

frantisek.hajnovic@ons.gov.uk