job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1)...

43
Frantisek (Fero) Hajnovic [email protected] Big data team Web scraping job vacancies (ESSnet on Big Data - Work package 1)

Transcript of job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1)...

Page 1: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Frantisek (Fero) [email protected]

Big data team

Web scraping

job vacancies

(ESSnet on Big Data - Work package 1)

Page 2: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Outline

Sample based

scraping

Full-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Page 3: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Outline

Sample based

scraping

Full-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Page 4: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Outline

Sample based

scraping

Full-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Page 5: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Outline

Sample based

scraping

Full-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Page 6: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Outline

Sample based

scraping

Full-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Page 7: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Outline

Sample based

scraping

Full-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Page 8: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Sample based

scrapingFull-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Page 9: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Proof of concept

● P.O.C. on random sample of 50 (large) companies○ Survey vs. Company websites vs. Job portals

Survey CW Indeed ...

Tesco 2345 1351 1525 ...

HSBC 321 243 210 ...

... ... ... ... ...

Page 10: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Useful quick insights

● Which portal is better?

● “Boots” problem

● Gap: survey - online

Page 11: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

pros and cons

+ Quick and simple scrapers

+ Entries already linked (matched)

+ Lightweight (less risk) - at least

for small sample

- Sample bias

- Effort to increase the sample

Page 12: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Sample based

scrapingFull-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Page 13: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Full-size scraping

● Directory

● “Proper” spiders○ Careerjet

○ CV-library

○ Universal job match

● T&Cs / robots.txt

+ Lot of data

+ Not influenced by sample

━- More “risky” scraping

- Need to match

Page 14: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Sample based

scrapingFull-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Page 15: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Matching company names

“Milton Keynes Borough Council“

“MILTON KEYNES COUNCIL INCL EDUCATION EXCL SCHOOLS WITH EXTERNAL PAYROLL PROVIDERS“

25

34

company name JV count

Survey

Careerjet

Milton Keynes council

● Casing, stop-words, (TF-)IDF scores, INCL/EXCL

● 434 entries matched (3.7%)

company

Page 16: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Sample based

scrapingFull-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Page 17: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Scraping company websites

● One website, one spider - no problem

● 50 websites

Page 18: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Specific spider code

● Name and rep. unit

● URL

● Extraction○ XPath

○ Regex pattern

Page 19: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Scraping company websites

● Type of access to the relevant HTML○ Simple HTTP. E.g. Caring homes○ Selenium. E.g. Care UK

● Obtaining count○ Direct count. E.g. Caring homes○ Counting vacancies. E.g. University of Portsmouth

● Pagination○ Not necessary. E.g. Caring homes○ Necessary. E.g. Somerset county

Page 20: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Scraping company websites

Page 21: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Scraping company websites

Page 22: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Sample based

scrapingFull-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites

DEMO!

DEMO!

Page 23: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Sample based

scrapingFull-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Page 24: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Project architecture

Python project

Spiders(scrapy, xpath, regex, beautiful soup)

Sample-based

PSB (portal sample-based)

Full-size

PFS (portal full-size)

Inde

ed

Car

eerje

t

Bric

k7

CV-

libra

ryU

JM (sample of 50 companies)

Emailer(mailjet)

[email protected]

Tests, scripts, notebooks...(nose, bash, jupyter,...)

Car

eerje

t

CW (comp. websites)(selenium)

...

Page 25: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Emails from scraping

Page 26: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Deploying project

Python project Google cloud

Spiders(scrapy, xpath, regex, beautiful soup)

Sample-based

PSB (portal sample-based)

Full-size

PFS (portal full-size)

CV-

libra

ryU

JM

Emailer(mailjet)

Deploy(bash)

“Managing” instance

Run scraping(Cron-job)

Mongo DB instance

24h

[email protected]

Turn on/off,Store data

Car

eerje

t

CW (comp. websites)(selenium)

Tests, scripts, notebooks...(nose, bash, jupyter,...)

(sample of 50 companies)In

deed

Car

eerje

t

Bric

k7

...

Page 27: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Technologies used

Page 28: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Sample based

scrapingFull-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Page 29: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Visualise the data

Python project Google cloud

Spiders(scrapy, xpath, regex, beautiful soup)

Sample-based

PSB (portal sample-based)

Full-size

PFS (portal full-size)

CV-

libra

ryU

JM

Emailer(mailjet)

Deploy(bash)

“Managing” instance

Run scraping(Cron-job)

Mongo DB instance

24h

[email protected]

Turn on/off,Store data

Car

eerje

t

CW (comp. websites)(selenium)

Dashboard(flask, bokeh, js)

Visualise

Tests, scripts, notebooks...(nose, bash, jupyter,...)

(sample of 50 companies)In

deed

Car

eerje

t

Bric

k7

...

Page 30: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Dashboard

Page 31: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Sample based

scrapingFull-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!

DEMO!

Page 32: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Sample based

scrapingFull-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Page 33: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Scatter plot with best fit

Page 34: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Bland-Altman plot with KDE

Page 35: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

BA-plots side by side

Page 36: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Krippendorff’s alpha

● inter-rater agreement

● <-1, 1>○ 1 = perfect agreement

○ 0 = absence of reliability

○ -1 = systematic disagreement

K.A. = 0.755

Page 37: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Comparing portals

Page 38: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Sample based

scrapingFull-size

scraping

Company names

matching

Comparisons

To-do

Automated scraping

framework

Dashboard

Scraping company

websites DEMO!DEMO!

Page 39: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Nowcasting

1 month

Survey data

Scraped data... ... ...

1 day traintrain train trainpredict

● In total

● Per industry

● Per company

Page 40: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Nowcasting survey entry

● Possible model inputs○ Scraped values

○ Previous survey values

○ Company parameters

○ Industry (dummy 0-1 coding)

○ Outlying factor

○ …● Possibly lots of training data

○ 6k entries in survey

○ monthly

For company X at time tPortal 1Portal 2

…Portal n

Comp. website

Survey(t-1)Survey(t-2)

…Survey(t-k)

Industry 1Industry 2

…Industry m

Employee size

Regression(neural network?)

Survey(t)

Outlying factor

Page 41: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

Scale up and expand!

● Why not?

○ New FS spider ⟶ 1 - 3 days

○ New SB spider ⟶ 1 - 3 days + sample

○ New CW spider ⟶ 10 minutes

● Sample ⟶ 100

● Improve matching

● Data from partners

Page 42: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

The deadly triangle

Page 43: job vacancies Web scraping · Web scraping job vacancies (ESSnet on Big Data - Work package 1) Outline Sample based scraping Full-size scraping Company names matching Comparisons

thanks!

Questions?

[email protected]