Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information...

32
Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino Fabio Mercorio Big Data for Labour Market Information focus on data from online job vacancies training workshop Milan, 21-22 November 2019

Transcript of Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information...

Page 1: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

Big Data for Labour Market Information

Session 7

Architecture: solutions for real-time LMI

(based on KDD)

Alessandro Vaccarino – Fabio Mercorio

Big Data for Labour Market Information – focus on data from online job vacancies – training workshop

Milan, 21-22 November 2019

Page 2: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

1. Goal & context2. Challenges

1. The functional architecture

2. Why use micro-services

3. The Team and the pipeline design

4. How handle infrastructure costs

2

Topics

Page 3: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino
Page 4: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino
Page 5: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

5

Challenges

• Handle a huge amount of near real time data

• Data coming from web Need to detect and reduce noise

• Multi language environment

• Need to relate to classification standards

• Find a way to summarize and present a wide and complex

scenario

Page 6: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino
Page 7: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

1. Goal & context

2. Challenges

1. Stakeholders2. The functional architecture

3. Why use micro-services

4. The Team and the pipeline design

7

Topics

Page 8: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

8

Stakeholders

Project

Leader

Key

Users

Domain

Experts

End

Users

Page 9: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

• Lead the project with the steering committee

• Define the scope of the project

• Define key organizations

• Maintain relations with stakeholders

• Provide advice

9

Project leader

Page 10: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

• Define requirements

• Monitor quality of the project

• Provide input to the development of the project

• Manage the source landscaping

• Validate overall data flow and methodology

10

Key Users

Page 11: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

• International Country Experts

• Provide the knowledge and expertise

• Execute the landscaping

• Understand the language/terms of their

context

• Evaluate the accuracy of the results

• Test the product

• Provide feedback

11

Domain Experts

Page 12: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

• Decision Makers and Business Users

o (Visual) Explore dataset, analysis and aggregate data

o Define new analysis processes

o Produce Data storytelling

o Make decisions by exploring data

• Data Scientists

o Apply new machine learning models and AI techniques

o Extract new insights from the data

o Apply advanced data modelling to the dataset

• Data Analysts

o Interprets data and turns it into information

o Identifying patterns and trends

o Extract and analyze aggregate data

o Publish and share their analysis

12

End Users

Page 13: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

1. Goal & context

2. Challenges

1. Stakeholders

2. The functional architecture3. Why use micro-services

4. The Team and the pipeline design

13

Topics

Page 14: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

Overall Data Flow

Data

Ingestion

Pre-Processing Information

Extraction

ETL Presentation

Area

Ingestion Processing Front end

Page 15: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

Conceptual architectureData ingestion Data processing Data analysis

Visual

interface

Data lab

Data

Supply

Mo

nit

or

an

d s

ched

ule

r

Cra

wle

r

Data

qu

ality

Data

pro

cess

ing a

nd

cla

ssif

icati

on

ET

L

Dir

ect

acc

ess

Scr

ap

er

Backup

Page 16: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

Labour Market

Analysts

Interactive Data

Analytics

Web Scraper

Web Crawler

Direct

Access

Pre-Processing

Information

Extraction and

Classification

Data Management

and Presentation

Employment

Agencies and

Public Employment

Services

Job Portals

Newspaper,

Companies

University Job

Placement

Classified Ads Sites

Job Vacancies

Classified on ISCO

Recognised NUTs

Other dimension

(contract, sector,

education, …)

Document

store

DW

Logical view

Page 17: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

DataIngestion

DataProcessing

Modelling, Machine

Learning, AI

Data visualization

Data storage & archiving

System and process monitoring

Automation & management

Input Output

UnstructuredData

Dashboard andinteractive report

Machine to machine

Web App

Physical view

Page 18: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

Modelling, Machine

Learning, AI

DataIngestion

DataProcessing

Data visualization

Data storage & archiving

System and process monitoring

Automation & management

Input Output

UnstructuredData

Dashboard andinteractive report

Machine to machine

Web App

Technology view

Page 19: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

- Micro-services

- Componentization

- Component specialization

- Small applications

- Portability

- Reuse

- Maintenance

- Scale Out

- Performance

Key design projects

Page 20: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

20

Key components

• Data ingestion: collect raw data from OJV in both

structured and unstructured (raw text) formats

• Data processing: classify data through machine

learning techniques

• Data analysis: extract information from data and

make it available through visualization

• Backup: store data in a safe environment to

allow warm and cold restore

Page 21: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

Infrastructure Challenges

• parallel ingestion

• high performance

at a glance

• High memory

• storage

• Scalable

Page 22: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

Big Data Flow

Infrastructure

challenges

Components

by definition

Quality

requirementsMicro-services

design

01010101000101010010101010010101

01010101000100101010100101

01010101000100101010100101

01010101000101010010101010010101

0101010100010010101010010101010101000100101010100101

010101010001010101010010010101010001010101010010

Page 23: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

1. Goal & context

2. Challenges

1. Stakeholders

2. The functional architecture

3. Why use micro-services4. The Team and the pipeline design

23

Topics

Page 24: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

Microservices

Page 25: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

25

Context

Manutability Monitoring Scability

Updates Onboarding

Page 26: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

26

Pre-Processing Microservices

Language

Detector

Spam

Filter

Deduplication

component

N-gram

component

Tokenizer

StemmerNo-Vacancy

Filter

Text Cleaner Merge Vacancy

TF-IDF

TransformerDocument2Vec

StopWords

Removers

Page 27: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

27

Classification Microservices

Skills

Classifier

Occupation

Classifier

Education

Requirements

Classifier

Industry

Classifier

WorkingHours

Detector

Contract

Detector

Locations

Detector

Dates

Extractor

Salary

Extractor

Experience

Extractor

Page 28: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

1. Services on request

2. Network access

3. Resource pooling

1. Governance

4. Quick elasticity

5. Measurement of services

1. Data Quality

2. Performance

6. Portability (on-premises and different cloud services)

7. Polyglot

1. Computer programming languages

2. Technologies

28

Technology requirements

Page 29: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

1. Goal & context

2. Challenges

1. Stakeholders

2. The functional architecture

3. Why use micro-services

4. The Team and the pipeline design

29

Topics

Page 30: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

1. Cloud Architects

2. Software Architects and Developers

3. Big Data Engineers

4. Data Scientists

5. Domain & Ontology Experts

30

The team

Page 31: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

31

Organizations

Cloud

InfrastructureService

Team

Components Micro-service

Service

ExecutionDefine

DesignDeploy

Develop

Page 32: Big Data for Labour Market Information · 2019. 11. 18. · Big Data for Labour Market Information Session 7 Architecture: solutions for real-time LMI (based on KDD) Alessandro Vaccarino

Organize around business services

Language Detector

Occupation Classifier

Salary Extractor Skills Classifier