Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from...

44
Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things: A Query A Connection First let's import pyathenajdbc and pandas libraries : In [1]: from pyathenajdbc import connect import pandas as pd With these libraries being imported, we can create our connection and run our queries: In [2]: # creating the connection conn = connect( access_key='AKIAJTT6VBHWL5AWOS6Q', secret_key='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', region_name='eu-central-1', schema_name='default', s3_staging_dir='s3://aws-athena-query-results-980872539443-eu-central-1/') # query we want to run query_1 = """ SELECT * FROM cedefop_presentation.ft_document_essnet ORDER BY RAND() LIMIT 1000; """ # reading data using connection and query documents = pd.read_sql(query_1, conn) # closing connection conn.close() Understanding our data

Transcript of Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from...

Page 1: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

Cedefop DataLAB

Access to dataIn order to get data from Athena or Hive we need to have two things:

A QueryA Connection

First let's import pyathenajdbc and pandas libraries :

In [1]:

from pyathenajdbc import connectimport pandas as pd

With these libraries being imported, we can create our connection and run our queries:

In [2]:

# creating the connectionconn = connect( access_key='AKIAJTT6VBHWL5AWOS6Q', secret_key='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', region_name='eu-central-1', schema_name='default', s3_staging_dir='s3://aws-athena-query-results-980872539443-eu-central-1/')

# query we want to runquery_1 = """ SELECT * FROM cedefop_presentation.ft_document_essnet ORDER BY RAND() LIMIT 1000; """# reading data using connection and querydocuments = pd.read_sql(query_1, conn)

# closing connection conn.close()

Understanding our data

Page 2: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

There are simple methods and attributes in Pandas library which allow us to get to know our data:

sizeheadtailsampledescribeinfo

General info

info( ) is a useful method which gives us different information about the data we've just imported:

column namesdata type of each columnmemory usage of data on the local disknumber of non-null values per each column

In [3]:

documents.info()

Page 3: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

<class 'pandas.core.frame.DataFrame'>RangeIndex: 1000 entries, 0 to 999Data columns (total 49 columns):general_id 1000 non-null objectgrab_date 1000 non-null int64year_grab_date 1000 non-null int64month_grab_date 1000 non-null int64day_grab_date 1000 non-null int64expire_date 1000 non-null int64year_expire_date 1000 non-null int64month_expire_date 1000 non-null int64day_expire_date 1000 non-null int64lang 1000 non-null objectidesco_level_4 1000 non-null objectesco_level_4 1000 non-null objectidesco_level_3 1000 non-null objectesco_level_3 1000 non-null objectidesco_level_2 1000 non-null objectesco_level_2 1000 non-null objectidesco_level_1 1000 non-null objectesco_level_1 1000 non-null objectidcity 1000 non-null objectcity 1000 non-null objectidprovince 1000 non-null objectprovince 1000 non-null objectidregion 1000 non-null objectregion 1000 non-null objectidmacro_region 1000 non-null objectmacro_region 1000 non-null objectidcountry 1000 non-null objectcountry 1000 non-null objectidcontract 1000 non-null objectcontract 1000 non-null objectideducational_level 1000 non-null objecteducational_level 1000 non-null objectidsector 1000 non-null objectsector 1000 non-null objectidmacro_sector 1000 non-null objectmacro_sector 1000 non-null objectidcategory_sector 1000 non-null objectcategory_sector 1000 non-null objectidsalary 1000 non-null objectsalary 1000 non-null objectidworking_hours 1000 non-null objectworking_hours 1000 non-null objectidexperience 1000 non-null objectexperience 1000 non-null objectsource_category 1000 non-null objectsourcecountry 1000 non-null objectsource 1000 non-null objectsite 1000 non-null objectcompanyname 1000 non-null objectdtypes: int64(8), object(41)memory usage: 382.9+ KB

Page 4: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

Shape of data

To quickly get number of rows and column of your data table you can use size attribute:

In [4]:

documents.shape

If you are just interested to know the number of records, you can use len( ) function instead:

In [5]:

len(documents)

Data preview

To have a sneak peek to data you can use either head, tail or sample:

Getting the first n rows of the table:

In [6]:

documents.head(10)

# try to use .head( ) without indicating number of rows

Out[4]:

(1000, 49)

Out[5]:

1000

Page 5: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

Getting last n rows of the table:

Out[6]:

general_id grab_date year_grab_date month_grab_date day_grab_date expire_date

0 89833230 17758 2018 8 15 17878

1 148565316 17872 2018 12 7 17932

2 168716345 17914 2019 1 18 18034

3 168990410 17915 2019 1 19 18035

4 86938457 17742 2018 7 30 17784

5 208673305 17959 2019 3 4 18079

6 166831480 17908 2019 1 12 18028

7 119787491 17825 2018 10 21 17945

8 247840311 17976 2019 3 21 18096

9 79270616 17723 2018 7 11 17843

10 rows × 49 columns

Page 6: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [7]:

documents.tail(3)

Getting a random sample of n rows from table:

In [8]:

documents.sample(4)

# try to use .sample( ) without indicating sample number

Descriptive statistics

Using describe( ) method you'll get a table with simple statistics for both numerical and categoricalfeatures:

Out[7]:

general_id grab_date year_grab_date month_grab_date day_grab_date expire_date

997 177733340 17937 2019 2 10 18057

998 145506827 17871 2018 12 6 17886

999 113561477 17807 2018 10 3 17927

3 rows × 49 columns

Out[8]:

general_id grab_date year_grab_date month_grab_date day_grab_date expire_date

353 178790867 17923 2019 1 27 18043

745 172033162 17923 2019 1 27 18043

650 78356125 17725 2018 7 13 17845

391 171295029 17921 2019 1 25 18041

4 rows × 49 columns

Page 7: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [9]:

documents.describe(include='all').round(1)

In case you want to get statistics just for the numeric numbers use documents.describe( ).round(1)

Data pre-processingAlost always before passing to processing phase, we need to perform several levels of pre-processing.One of the most important pre-processing tasks is Deduplication:

Deduplication

Due to different reasons we may have duplicated rows in our data. Sometimes these rows are completelyidentical and sometimes, like in our example, duplicated rows indicate the same job announcementcoming from different sources. In this case in order to identify and eliminate these records we should usegeneral_id field:

In [10]:

documents = documents.drop_duplicates(subset=['general_id'])

Out[9]:

general_id grab_date year_grab_date month_grab_date day_grab_date expire_date

count 1000 1000.0 1000.0 1000.0 1000.0 1000.0

unique 1000 NaN NaN NaN NaN NaN

top 166663080 NaN NaN NaN NaN NaN

freq 1 NaN NaN NaN NaN NaN

mean NaN 17847.8 2018.3 7.2 15.4 17953.6

std NaN 72.2 0.5 4.0 8.7 76.5

min NaN 17714.0 2018.0 1.0 1.0 17748.0

25% NaN 17787.0 2018.0 3.0 8.0 17887.0

50% NaN 17852.0 2018.0 8.0 15.0 17960.0

75% NaN 17907.0 2019.0 11.0 23.0 18020.0

max NaN 17986.0 2019.0 12.0 31.0 18106.0

11 rows × 49 columns

Page 8: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

The reason why we used subset=['general_id'] is that in this example, we're not looking for exactlyidentical rows but for us, any two records with the same general_id consider as duplicates.

Since we just get the first 100 records from Athena, there is no duplicated record in this subset. That'swhy if we get number of records in the deduplicated table, we still have 100 records:

In [11]:

len(documents)

Data aggregation and manipulation

Filtering data

Filtering data for a specific country:

In [16]:

# creating the connectionconn = connect( access_key='AKIAJTT6VBHWL5AWOS6Q', secret_key='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', region_name='eu-central-1', schema_name='default', s3_staging_dir='s3://aws-athena-query-results-980872539443-eu-central-1/')

# query we want to runquery_1 = """ SELECT COUNT(DISTINCT GENERAL_ID) as num_job_vacancy FROM cedefop_presentation.ft_document_essnet WHERE country = 'UNITED KINGDOM' """# reading data using connection and querydocuments = pd.read_sql(query_1, conn)

In [17]:

print(documents)

Out[11]:

1000

num_job_vacancy0 13303636

Page 9: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [70]:

# query we want to runquery_1 = """ SELECT COUNTRY, COUNT(DISTINCT GENERAL_ID) as num_job_vacancy FROM cedefop_presentation.ft_document_essnet GROUP BY COUNTRY ORDER BY num_job_vacancy desc """# reading data using connection and querydocuments = pd.read_sql(query_1, conn)

documents.head(20)

Multiple filtering on year and month:

Out[70]:

COUNTRY num_job_vacancy

0 DEUTSCHLAND 15679793

1 UNITED KINGDOM 13303636

2 FRANCE 11959196

3 NEDERLAND 3483875

4 ITALIA 2371035

5 ESPAÑA 1832400

6 BELGIQUE-BELGIË 1802390

7 POLSKA 1444290

8 ÖSTERREICH 1413784

9 SVERIGE 1063498

10 IRELAND 570060

11 ČESKÁ REPUBLIKA 541947

12 LUXEMBOURG 77559

Page 10: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [20]:

# query we want to runquery_1 = """ SELECT * FROM cedefop_presentation.ft_document_essnet WHERE country = 'UNITED KINGDOM' and year_grab_date = 2018 and month_grab_date = 12 limit 100 """# reading data using connection and querydocuments = pd.read_sql(query_1, conn)

documents.head()

Filtering on a specific occupation:

Out[20]:

general_id grab_date year_grab_date month_grab_date day_grab_date expire_date

0 146726257 17871 2018 12 6 17991

1 152031314 17876 2018 12 11 17996

2 152222010 17876 2018 12 11 17996

3 152155722 17878 2018 12 13 17998

4 152152423 17878 2018 12 13 17998

5 rows × 49 columns

Page 11: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [24]:

# query we want to runquery_1 = """ SELECT esco_level_4, count(distinct general_id) as num_ojv FROM "AwsDataCatalog".cedefop_presentation.ft_document_essnet WHERE country = 'UNITED KINGDOM' and year_grab_date = 2018 and esco_level_4 = 'Software developers' GROUP BY esco_level_4 """# reading data using connection and querydocuments = pd.read_sql(query_1, conn)

documents

Group by

Top 15 occupations by country in 2018:

In [95]:

# query we want to runquery_1 = """ SELECT * FROM ( SELECT idcountry, esco_level_4, num_ojv, rank() over (partition by idcountry order by num_ojv desc) as rank FROM ( SELECT idcountry, esco_level_4, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet WHERE year_grab_date = 2018 GROUP BY esco_level_4, idcountry ORDER BY num_ojv DESC ) t1 ) t2 where rank <= 15 """

# reading data using connection and querytop_15_occ_df = pd.read_sql(query_1, conn)

Out[24]:

esco_level_4 num_ojv

0 Software developers 402245

Page 12: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [96]:

# reindexing the tabletop_15_occ_df.set_index(["idcountry","esco_level_4"])

# renaming the columnstop_15_occ_df.columns = ['idcountry', 'occupation', 'count', 'rank']

# rearranging the columnstop_15_occ_df = top_15_occ_df[['idcountry', 'occupation','count', 'rank']]

# sort the tabletop_15_occ_df = top_15_occ_df.sort_values(['count'],ascending=False)

let's take a look at the result:

In [97]:

top_15_occ_df[top_15_occ_df['idcountry']=='IT'].head(15)

Top 5 sectors, in 2019, by country:

Out[97]:

idcountry occupation count rank

90 IT Freight handlers 72314 1

91 IT Shop sales assistants 70410 2

92 IT Software developers 49289 3

93 IT Cleaners and helpers in offices, hotels and ot... 48368 4

94 IT Manufacturing labourers not elsewhere classified 41240 5

95 IT Administrative and executive secretaries 38120 6

96 IT Draughtspersons 32772 7

97 IT Commercial sales representatives 29679 8

98 IT Assemblers not elsewhere classified 28718 9

99 IT Metal working machine tool setters and operators 27659 10

100 IT Systems analysts 26714 11

101 IT Accounting and bookkeeping clerks 25259 12

102 IT Advertising and marketing professionals 24595 13

103 IT Electrical mechanics and fitters 24527 14

104 IT Retail and wholesale trade managers 22844 15

Page 13: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [106]:

# query we want to runquery_1 = """ SELECT * FROM ( SELECT idcountry, macro_sector, num_ojv, rank() over (partition by idcountry order by num_ojv desc) as rank FROM ( SELECT idcountry, macro_sector, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet WHERE year_grab_date = 2019 GROUP BY macro_sector, idcountry ORDER BY num_ojv DESC ) t1 ) t2 where rank <= 15 """

# reading data using connection and querytop_15_sect_df = pd.read_sql(query_1, conn)

In [107]:

# reindexing the tabletop_15_sect_df.set_index(["idcountry","macro_sector"])

# renaming the columnstop_15_sect_df.columns = ['idcountry', 'macro_sector', 'count', 'rank']

# rearranging the columnstop_15_sect_df = top_15_sect_df[['idcountry', 'macro_sector','count', 'rank']]

# sort the tabletop_15_sect_df = top_15_sect_df.sort_values(['count'],ascending=False)

In [108]:

top_15_sect_df[top_15_sect_df['idcountry']=='UK'].head(5)

Top 5 occupations by country and sector:

Out[108]:

idcountry macro_sector count rank

180 UK Administrative and support service activities 767198 1

181 UK Professional, scientific and technical activit... 710293 2

182 UK Human health and social work activities 458694 3

183 UK Information and communication 317472 4

184 UK Other service activities 215169 5

Page 14: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [146]:

# query we want to runquery_1 = """ SELECT * FROM ( SELECT idcountry, idmacro_sector, macro_sector, esco_level_4, num_ojv, rank() over (partition by idcountry, macro_sector,idmacro_sector order by num_ojv desc) as rank FROM ( SELECT idcountry, idmacro_sector, macro_sector,esco_level_4, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet WHERE year_grab_date = 2019 GROUP BY idmacro_sector, macro_sector, esco_level_4, idcountry ORDER BY num_ojv DESC ) t1 ) t2 where rank <= 5 """

# reading data using connection and querytop15_occ_by_count_sector = pd.read_sql(query_1, conn)

In [147]:

# reindexing the tabletop15_occ_by_count_sector.set_index(["idcountry","idmacro_sector","macro_sector",'esco_level_4'])

# renaming the columnstop15_occ_by_count_sector.columns = ['idcountry', 'idmacro_sector', 'macro_sector', 'occupation','count', 'rank']

# rearranging the columnstop15_occ_by_count_sector = top15_occ_by_count_sector[['idcountry', 'idmacro_sector', 'macro_sector', 'occupation','count', 'rank']]

# sort the tabletop15_occ_by_count_sector = top15_occ_by_count_sector.sort_values(['count'],ascending=False)

In [148]:

top15_occ_by_count_sector.head(10)

Page 15: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

Out[148]:

idcountry idmacro_sector macro_sector occupation count rank

460 DE GWholesale andretail trade; repairof motor ve...

Shop salesassistants

95170 1

1274 UK JInformation andcommunication

Softwaredevelopers

78750 1

1393 UK QHuman healthand social workactivities

Nursingprofessionals

72297 1

365 DE NAdministrativeand supportservice activities

Administrativeand executivesecretaries

60448 1

385 DE MProfessional,scientific andtechnical activit...

Systems analysts 55686 1

375 DE JInformation andcommunication

Softwaredevelopers

49373 1

380 DE C Manufacturing

Manufacturinglabourers notelsewhereclassified

47292 1

370 DE QHuman healthand social workactivities

Health careassistants

46932 1

386 DE MProfessional,scientific andtechnical activit...

Engineeringprofessionals notelsewhereclassi...

46488 2

366 DE NAdministrativeand supportservice activities

Manufacturinglabourers notelsewhereclassified

45413 2

Page 16: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [150]:

top15_occ_by_count_sector[(top15_occ_by_count_sector['idcountry'] == 'DE') & (top15_occ_by_count_sector['idmacro_sector'] == 'J')].head(5)

Data VisualizationLet's make a simple plot which shows the number of announcements per month. To do so, first, weshould group our data by year and month and then count the records:

In [158]:

# query we want to runquery_1 = """ SELECT idcountry, year_grab_date, month_grab_date, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet GROUP BY idcountry, year_grab_date, month_grab_date """

# reading data using connection and querydate_groupped = pd.read_sql(query_1, conn)

Out[150]:

idcountry idmacro_sector macro_sector occupation count rank

375 DE JInformation andcommunication

Software developers 49373 1

376 DE JInformation andcommunication

Systems analysts 27968 2

377 DE JInformation andcommunication

Systemsadministrators

10222 3

378 DE JInformation andcommunication

Engineeringprofessionals notelsewhere classi...

6989 4

379 DE JInformation andcommunication

Advertising andmarketingprofessionals

6616 5

Page 17: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [161]:

date_groupped.reset_index()date_groupped.head()

Now that we have our aggregated data, we should start plotting. Python community offers a wide rangeof visualization packages but here, we stick with the matplotlib, a classic choice!

In [160]:

import matplotlib.pyplot as pltimport matplotlib.dates as mdates

In [165]:

# getting number of records from tablecounts = date_groupped.groupby(['year_grab_date','month_grab_date']).sum()

counts

Out[161]:

idcountry year_grab_date month_grab_date num_ojv

0 IT 2018 7 230650

1 BE 2019 3 192837

2 BE 2018 12 180383

3 ES 2018 7 170325

4 BE 2018 8 192713

Out[165]:

num_ojv

year_grab_date month_grab_date

2018 7 5142680

8 5271987

9 6237170

10 6619957

11 8683441

12 6446523

2019 1 8013169

2 4566919

3 4561617

Page 18: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [166]:

# creating a date range and set it as the index of our datacounts.index = pd.DatetimeIndex(start='2018-07-01',end='2019-03-31' , freq='MS')

# setting the size of the plotfig, ax = plt.subplots(figsize=(15,7))

# plot the data (blue lines)plt.plot(counts.index , counts)

# plot the data (black dots)plt.scatter(counts.index , counts, c='k', zorder=10)

# setting the x ticks of the plot as index of data (dates)plt.xticks(counts.index)

# setting X and Y axes labelsplt.xlabel('Date')plt.ylabel('# Announcements')

# change the format of date ticksax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))

# setting the label of the plotplt.title('Total Monthly Number of Announcements', fontsize= 20)

# drawing the plotplt.show()

Now let's repeat what we have just did, using a subset of data by filtering it for one country:

Page 19: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [174]:

filter_country = 'DE'data_by_country = date_groupped[date_groupped.idcountry == filter_country]data_by_country

The other steps are identical to what we have done for the previous plot:

Out[174]:

idcountry year_grab_date month_grab_date num_ojv

27 DE 2019 1 2280502

29 DE 2018 9 1487587

40 DE 2018 7 1345705

42 DE 2018 12 1782813

55 DE 2019 2 1525130

72 DE 2019 3 1487849

83 DE 2018 11 2373502

95 DE 2018 8 1535226

102 DE 2018 10 1861479

Page 20: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [175]:

counts = data_by_country.num_ojvcounts.index = pd.DatetimeIndex(start='2018-07-01',end='2019-03-31' , freq='MS')

fig, ax = plt.subplots(figsize=(15,7))plt.plot(counts.index , counts)plt.scatter(counts.index , counts, c='k', zorder=10)plt.xticks(counts.index)plt.xlabel('Date')plt.ylabel('# Announcements')ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))plt.title(f'Monthly Number of Announcements - {filter_country}', fontsize= 20)plt.show()

Ok, let's try another type of visualization: Pie Chart

We want to add two filters on city and occupation and plot a pie chart for quantity of contract types.

Note: 25 for ISCO/ESCO -> some ITC occupations

In [180]:

# query we want to runquery_1 = """ SELECT contract, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet WHERE city = 'Milano' and idesco_level_2 = '25' GROUP BY contract """

# reading data using connection and queryfiltered = pd.read_sql(query_1, conn)

Page 21: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [181]:

# getting count distinct fot each contract and convert the result to the dataframepie_data = pd.DataFrame(filtered.contract.value_counts())

# Notice that we're not directly using Marplotlib as we did for the previous plots# Pandas actully uses Matplotlib so for simple plots like this one you can use# integrated visualizations of pandas without explecitly use Matplotlib functionspie_data.plot.pie(y='contract', figsize=(8, 8))plt.show()

Repeating the previous plot, this time for working hours:

Page 22: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [182]:

# query we want to runquery_1 = """ SELECT working_hours, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet WHERE city = 'Milano' and idesco_level_2 = '25' GROUP BY working_hours """

# reading data using connection and queryfiltered = pd.read_sql(query_1, conn)

pie_data = pd.DataFrame(filtered.working_hours.value_counts())pie_data.plot.pie(y='working_hours', figsize=(8, 8))plt.show()

In [183]:

conn.close()

Case-Study : Source Country and DestionationCountry

Page 23: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

create a pivot table using sourcecountry and country as index and columns and percentage ofcountry records for each sourccecountryRemove non significant values from pivot(in this case ones which are less than 5%)Sort both rows and and columns of pivot table (Descending)Import Skill data from DataLabl and perform the following actions on it:

headinfocount null valuesget the most frequent skill by country

PIVOT TABLE

In [199]:

# creating the connectionconn = connect( access_key='AKIAJTT6VBHWL5AWOS6Q', secret_key='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', region_name='eu-central-1', schema_name='default', s3_staging_dir='s3://aws-athena-query-results-980872539443-eu-central-1/')

# query we want to runquery_1 = """ SELECT idcountry, sourcecountry, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet WHERE sourcecountry in ('IT', 'UK', 'IE', 'CZ', 'FR', 'DE', 'ES', 'AT', 'PL', 'BE', 'NL', 'SE', 'LU' ) GROUP BY sourcecountry, idcountry """

# reading data using connection and querydocuments = pd.read_sql(query_1, conn)

conn.close()

Page 24: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [200]:

documents.head()

In [201]:

# groupping and calculating percentagescoutry_groupped = documents.groupby(['idcountry', 'sourcecountry'])\ .sum().groupby(level=0).apply(lambda x:100 * x / float(x.sum()))\ .reset_index()

# making pivot table, filling blank cells with zero and round the values to one decimal numberpivot_data = coutry_groupped.pivot(index='idcountry', columns='sourcecountry', values='num_ojv').fillna(0).round(1)

Out[200]:

idcountry sourcecountry num_ojv

0 NL SE 1778

1 CZ BE 60

2 FR UK 9429

3 PL IE 48

4 UK SE 2832

Page 25: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [202]:

pivot_data

DATA CLEANING

In [203]:

# We need to import numpy firstimport numpy as np

Out[202]:

sourcecountry AT BE CZ DE ES FR IE IT LU NL PL SE UK

idcountry

AT 87.5 0.0 0.0 11.9 0.0 0.2 0.0 0.0 0.0 0.1 0.2 0.0 0.0

BE 0.1 94.5 0.0 1.2 0.1 1.5 0.3 0.1 0.0 1.4 0.4 0.0 0.2

CZ 0.0 0.0 99.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0

DE 2.2 0.0 0.0 97.3 0.0 0.2 0.0 0.0 0.0 0.0 0.2 0.1 0.0

ES 0.0 0.0 0.0 0.2 97.8 1.6 0.0 0.1 0.0 0.1 0.0 0.1 0.1

FR 0.0 0.4 0.0 0.3 0.1 98.8 0.1 0.1 0.0 0.1 0.0 0.0 0.1

IE 0.1 0.1 10.3 0.6 0.3 0.1 85.2 0.1 0.0 0.4 0.1 0.0 2.7

IT 0.2 0.2 0.0 0.3 0.2 0.3 0.0 98.5 0.0 0.1 0.1 0.1 0.0

LU 0.1 5.9 0.0 2.0 0.0 3.0 0.0 0.0 88.4 0.3 0.0 0.0 0.2

NL 0.1 0.4 0.0 1.4 0.1 0.0 0.0 0.0 0.0 97.4 0.3 0.1 0.2

PL 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 99.8 0.1 0.0

SE 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.1 0.0 0.0 0.0 99.5 0.1

UK 0.1 0.0 0.0 0.6 0.0 0.4 0.2 0.0 0.0 0.3 0.0 0.0 98.4

Page 26: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [204]:

# With this line of code we replace the values less that 5% with 0

# .apply --> applies a function to the datatable# lambda x : do(x) --> a simple and fast way to write a function# np.where(condition, something, something_else) --> similar to =IF() function from Excelpivot_data.apply(lambda x : np.where(x<5, 0, x))

The skillsImporting skill data from Athena:

Out[204]:

sourcecountry AT BE CZ DE ES FR IE IT LU NL PL SE UK

idcountry

AT 87.5 0.0 0.0 11.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

BE 0.0 94.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

CZ 0.0 0.0 99.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

DE 0.0 0.0 0.0 97.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

ES 0.0 0.0 0.0 0.0 97.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

FR 0.0 0.0 0.0 0.0 0.0 98.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0

IE 0.0 0.0 10.3 0.0 0.0 0.0 85.2 0.0 0.0 0.0 0.0 0.0 0.0

IT 0.0 0.0 0.0 0.0 0.0 0.0 0.0 98.5 0.0 0.0 0.0 0.0 0.0

LU 0.0 5.9 0.0 0.0 0.0 0.0 0.0 0.0 88.4 0.0 0.0 0.0 0.0

NL 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 97.4 0.0 0.0 0.0

PL 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 99.8 0.0 0.0

SE 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 99.5 0.0

UK 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 98.4

Page 27: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [208]:

# creating the connectionconn = connect( access_key='AKIAJTT6VBHWL5AWOS6Q', secret_key='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', region_name='eu-central-1', schema_name='default', s3_staging_dir='s3://aws-athena-query-results-980872539443-eu-central-1/')

# query we want to runquery_2 = """ SELECT * FROM "AwsDataCatalog".cedefop_presentation.ft_skill_analysis_essnet ORDER BY RAND() LIMIT 1000; """# reading data using connection and queryskills = pd.read_sql(query_2, conn)

# closing connection conn.close()

In [209]:

skills.head()

Out[209]:

general_id grab_date year_grab_date month_grab_date day_grab_date expire_date

0 163539137 17905 2019 1 9 18025

1 175883759 17935 2019 2 8 17995

2 166123483 17908 2019 1 12 18028

3 136630915 17852 2018 11 17 17972

4 82106708 17731 2018 7 19 17851

5 rows × 51 columns

Page 28: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [210]:

skills.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex: 1000 entries, 0 to 999Data columns (total 51 columns):general_id 1000 non-null objectgrab_date 1000 non-null int64year_grab_date 1000 non-null int64month_grab_date 1000 non-null int64day_grab_date 1000 non-null int64expire_date 1000 non-null int64year_expire_date 1000 non-null int64month_expire_date 1000 non-null int64day_expire_date 1000 non-null int64lang 1000 non-null objectidesco_level_4 1000 non-null objectesco_level_4 1000 non-null objectidesco_level_3 1000 non-null objectesco_level_3 1000 non-null objectidesco_level_2 1000 non-null objectesco_level_2 1000 non-null objectidesco_level_1 1000 non-null objectesco_level_1 1000 non-null objectidescoskill_level_3 1000 non-null objectescoskill_level_3 1000 non-null objectidcity 1000 non-null objectcity 1000 non-null objectidprovince 1000 non-null objectprovince 1000 non-null objectidregion 1000 non-null objectregion 1000 non-null objectidmacro_region 1000 non-null objectmacro_region 1000 non-null objectidcountry 1000 non-null objectcountry 1000 non-null objectidcontract 1000 non-null objectcontract 1000 non-null objectideducational_level 1000 non-null objecteducational_level 1000 non-null objectidsector 1000 non-null objectsector 1000 non-null objectidmacro_sector 1000 non-null objectmacro_sector 1000 non-null objectidcategory_sector 1000 non-null objectcategory_sector 1000 non-null objectidsalary 1000 non-null objectsalary 1000 non-null objectidworking_hours 1000 non-null objectworking_hours 1000 non-null objectidexperience 1000 non-null objectexperience 1000 non-null objectsource_category 1000 non-null objectsourcecountry 1000 non-null objectsource 1000 non-null objectsite 1000 non-null object

Page 29: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

Null Values

to get the number of null cells for each column we should first use .isnull( ) method which returns 1 foreach null cell and 0 for a non-null cell. Then we should sum these 0s and 1s to get the total number ofnull values for each column:

In [211]:

skills.isnull().sum()

companyname 1000 non-null objectdtypes: int64(8), object(43)memory usage: 398.5+ KB

Page 30: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

Out[211]:

general_id 0grab_date 0year_grab_date 0month_grab_date 0day_grab_date 0expire_date 0year_expire_date 0month_expire_date 0day_expire_date 0lang 0idesco_level_4 0esco_level_4 0idesco_level_3 0esco_level_3 0idesco_level_2 0esco_level_2 0idesco_level_1 0esco_level_1 0idescoskill_level_3 0escoskill_level_3 0idcity 0city 0idprovince 0province 0idregion 0region 0idmacro_region 0macro_region 0idcountry 0country 0idcontract 0contract 0ideducational_level 0educational_level 0idsector 0sector 0idmacro_sector 0macro_sector 0idcategory_sector 0category_sector 0idsalary 0salary 0idworking_hours 0working_hours 0idexperience 0experience 0source_category 0sourcecountry 0source 0site 0companyname 0dtype: int64

Page 31: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

Finding top skills by country:

In [212]:

vals = []countries = []sk = []for country in skills.country.unique(): sag = skills[skills.country == country]['escoskill_level_3'] vals.append(sag.value_counts()[0]) sk.append(sag.value_counts().index[0]) countries.append(country)

In [213]:

res = pd.DataFrame([countries, sk, vals]).Tres.columns = ['country', 'skill', 'count']

In [214]:

res

Case-Study : Digital Occupations

Out[214]:

country skill count

0 NEDERLAND proactivity 3

1 DEUTSCHLAND adapt to change 14

2 UNITED KINGDOM adapt to change 17

3 ESPAÑA ICT networking hardware 1

4 ÖSTERREICH adapt to change 4

5 ITALIA communication 3

6 BELGIQUE-BELGIË create solutions to problems 2

7 SVERIGE communicate with customers 3

8 FRANCE adapt to change 11

9 POLSKA manage time 2

10 ČESKÁ REPUBLIKA engineering processes 1

11 IRELAND communication 1

Page 32: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

Using the provided list of Eurostat digital occupations, create a subset of skills data whichcontains only these occupationsFor each digital occupation calculate the mixture of skills in percentage termsFocus on programming languanges

In [215]:

prof_digital = ['1330', '2511', '2512', '2513', '2514', '2519', '2521','2522', '2523', '2529', '3511', '3512', '3513', '3514', '3521', '3522']

In [218]:

# creating the connectionconn = connect( access_key='AKIAJTT6VBHWL5AWOS6Q', secret_key='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', region_name='eu-central-1', schema_name='default', s3_staging_dir='s3://aws-athena-query-results-980872539443-eu-central-1/')

# query we want to run# query we want to runquery_1 = """ SELECT * FROM ( SELECT esco_level_4, escoskill_level_3, num_ojv, rank() over (partition by esco_level_4 order by num_ojv desc) as rank FROM ( SELECT esco_level_4, escoskill_level_3, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_skill_analysis_essnet WHERE idesco_level_4 IN ('1330', '2511', '2512', '2513', '2514', '2519', '2521','2522', '2523', '2529', '3511', '3512', '3513', '3514', '3521', '3522') GROUP BY esco_level_4, escoskill_level_3 ORDER BY num_ojv DESC ) t1 ) t2 where rank <= 5 """

# reading data using connection and querydocuments = pd.read_sql(query_1, conn)

conn.close()

Page 33: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [220]:

documents[documents['esco_level_4']=='Software developers'].head(5)

PROGRAMMING LANGUAGES BY LOCATION

In [50]:

langs = ['SQL', 'Java', 'C#', 'Python', 'PHP', 'matlab', 'SAS language', 'C++']

In [221]:

# creating the connectionconn = connect( access_key='AKIAJTT6VBHWL5AWOS6Q', secret_key='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', region_name='eu-central-1', schema_name='default', s3_staging_dir='s3://aws-athena-query-results-980872539443-eu-central-1/')

# query we want to run# query we want to runquery_1 = """ SELECT idcountry, escoskill_level_3, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_skill_analysis_essnet WHERE escoskill_level_3 IN ('SQL', 'Java', 'C#', 'Python', 'PHP', 'matlab', 'SAS language', 'C++') GROUP BY idcountry, escoskill_level_3 ORDER BY num_ojv DESC """

# reading data using connection and querydocuments = pd.read_sql(query_1, conn)

conn.close()

Out[220]:

esco_level_4 escoskill_level_3 num_ojv rank

40 Software developers adapt to change 1279015 1

41 Software developers project management 1137579 2

42 Software developers computer programming 1101046 3

43 Software developers English 999963 4

44 Software developers teamwork principles 935304 5

Page 34: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [225]:

# groupby skill and countrylang_g = documents.reset_index()\ .sort_values(['num_ojv'], ascending=False)

In [226]:

lang_g[lang_g['idcountry']=='UK'].head(20)

Compare with DE...

In [227]:

lang_g[lang_g['idcountry']=='DE'].head(20)

Out[226]:

index idcountry escoskill_level_3 num_ojv

0 0 UK SQL 450804

3 3 UK Java 225294

4 4 UK Python 171930

8 8 UK C# 115636

9 9 UK PHP 112480

12 12 UK C++ 84014

42 42 UK SAS language 13227

56 56 UK matlab 7254

Out[227]:

index idcountry escoskill_level_3 num_ojv

1 1 DE SQL 241867

2 2 DE Java 233301

7 7 DE C++ 124366

10 10 DE PHP 106713

14 14 DE Python 82810

24 24 DE C# 39666

26 26 DE matlab 29322

51 51 DE SAS language 9235

Page 35: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

... or Pivotting the data

In [230]:

lang_g.pivot(index='escoskill_level_3',columns='idcountry', values='num_ojv').fillna(0)

Case-Study : Education vs ExperienceCreating a bubble chart for Education and Experience with number of records as the size of bubbles:

Out[230]:

idcountry AT BE CZ DE ES FR IE

escoskill_level_3

C# 4168.0 4794.0 1172.0 39666.0 17889.0 21808.0 14227.0 13341.0

C++ 11446.0 2774.0 2090.0 124366.0 15214.0 41115.0 2644.0 13002.0

Java 22522.0 13014.0 5013.0 233301.0 47641.0 124376.0 13060.0 40730.0

PHP 6735.0 3934.0 1405.0 106713.0 20267.0 58000.0 3333.0 14702.0

Python 5699.0 4589.0 1894.0 82810.0 17914.0 50950.0 7813.0 7910.0

SAS language 323.0 708.0 0.0 9235.0 683.0 103818.0 358.0 880.0

SQL 27988.0 23465.0 8221.0 241867.0 51685.0 126214.0 22828.0 52186.0

matlab 2095.0 250.0 0.0 29322.0 259.0 5387.0 0.0 1171.0

Page 36: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [252]:

# creating the connectionconn = connect( access_key='AKIAJTT6VBHWL5AWOS6Q', secret_key='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', region_name='eu-central-1', schema_name='default', s3_staging_dir='s3://aws-athena-query-results-980872539443-eu-central-1/')

# query we want to run# query we want to runquery_1 = """ SELECT ideducational_level, educational_level, idexperience, experience, count(distinct general_id) as num_ojv FROM cedefop_presentation.ft_document_essnet WHERE idesco_level_4 IN ('1330', '2511', '2512', '2513', '2514', '2519', '2521','2522', '2523', '2529', '3511', '3512', '3513', '3514', '3521', '3522') GROUP BY ideducational_level, educational_level, idexperience, experience ORDER BY ideducational_level ASC, idexperience ASC """

# reading data using connection and querydocuments = pd.read_sql(query_1, conn)

conn.close()

In [253]:

# First we should aggregate the data for the desired colummnsedu_exp = documents.reset_index()

In [254]:

edu_exp

Out[254]:

index ideducational_level educational_level idexperience experience num_ojv

0 0 33289

1 1 1Noexperience

5324

2 2 2Up to 1year

13588

3 3 3From 1 to 2years

4210

Page 37: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

4 4 4 From 2 to 4years

5046

5 5 5From 4 to 6years

939

6 6 6From 6 to 8years

224

7 7 7From 8 to10 years

111

8 8 8Over 10years

7605

9 9 1 Primary education 8167

10 10 1 Primary education 1Noexperience

1468

11 11 1 Primary education 2Up to 1year

6732

12 12 1 Primary education 3From 1 to 2years

1900

13 13 1 Primary education 4From 2 to 4years

2569

14 14 1 Primary education 5From 4 to 6years

393

15 15 1 Primary education 6From 6 to 8years

73

16 16 1 Primary education 7From 8 to10 years

26

17 17 1 Primary education 8Over 10years

2033

18 18 2Lower secondaryeducation

151209

19 19 2Lower secondaryeducation

1Noexperience

3935

20 20 2Lower secondaryeducation

2Up to 1year

98522

21 21 2Lower secondaryeducation

3From 1 to 2years

28474

22 22 2Lower secondaryeducation

4From 2 to 4years

19069

23 23 2Lower secondaryeducation

5From 4 to 6years

4330

Page 38: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

24 24 2 Lower secondaryeducation

6 From 6 to 8years

1525

25 25 2Lower secondaryeducation

7From 8 to10 years

684

26 26 2Lower secondaryeducation

8Over 10years

44730

27 27 3Post-secondarynon-tertiaryeducation

254480

28 28 3Post-secondarynon-tertiaryeducation

1Noexperience

10386

29 29 3Post-secondarynon-tertiaryeducation

2Up to 1year

257372

... ... ... ... ... ... ...

51 51 5Short-cycletertiary education

6From 6 to 8years

7097

52 52 5Short-cycletertiary education

7From 8 to10 years

2220

53 53 5Short-cycletertiary education

8Over 10years

154879

54 54 6Bachelor orequivalent

333980

55 55 6Bachelor orequivalent

1Noexperience

20851

56 56 6Bachelor orequivalent

2Up to 1year

257164

57 57 6Bachelor orequivalent

3From 1 to 2years

52426

58 58 6Bachelor orequivalent

4From 2 to 4years

84984

59 59 6Bachelor orequivalent

5From 4 to 6years

16845

60 60 6Bachelor orequivalent

6From 6 to 8years

7468

61 61 6Bachelor orequivalent

7From 8 to10 years

2467

62 62 6Bachelor orequivalent

8Over 10years

116087

Page 39: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

63 63 7 Master orequivalent

172997

64 64 7Master orequivalent

1Noexperience

10289

65 65 7Master orequivalent

2Up to 1year

129748

66 66 7Master orequivalent

3From 1 to 2years

21435

67 67 7Master orequivalent

4From 2 to 4years

43954

68 68 7Master orequivalent

5From 4 to 6years

12525

69 69 7Master orequivalent

6From 6 to 8years

2131

70 70 7Master orequivalent

7From 8 to10 years

778

71 71 7Master orequivalent

8Over 10years

35596

72 72 8Doctoral orequivalent

20955

73 73 8Doctoral orequivalent

1Noexperience

740

74 74 8Doctoral orequivalent

2Up to 1year

17649

75 75 8Doctoral orequivalent

3From 1 to 2years

3008

76 76 8Doctoral orequivalent

4From 2 to 4years

4720

77 77 8Doctoral orequivalent

5From 4 to 6years

1391

78 78 8Doctoral orequivalent

6From 6 to 8years

245

79 79 8Doctoral orequivalent

7From 8 to10 years

58

80 80 8Doctoral orequivalent

8Over 10years

4505

81 rows × 6 columns

Page 40: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

there are some missing data which we should remove before starting with plotting. In these case themissing data are indicated by "":

In [255]:

# Replacing "" with np.nan which in python represents a missing dataedu_exp.replace('', np.nan, inplace=True)# removing rows with "any" missing valueedu_exp.dropna(inplace=True)

Now our table is ready for plotting:

In [256]:

edu_exp

Out[256]:

index ideducational_level educational_level idexperience experience num_ojv

10 10 1 Primary education 1Noexperience

1468

11 11 1 Primary education 2Up to 1year

6732

12 12 1 Primary education 3From 1 to 2years

1900

13 13 1 Primary education 4From 2 to 4years

2569

14 14 1 Primary education 5From 4 to 6years

393

15 15 1 Primary education 6From 6 to 8years

73

16 16 1 Primary education 7From 8 to10 years

26

17 17 1 Primary education 8Over 10years

2033

19 19 2Lower secondaryeducation

1Noexperience

3935

20 20 2Lower secondaryeducation

2Up to 1year

98522

21 21 2Lower secondaryeducation

3From 1 to 2years

28474

22 22 2Lower secondaryeducation

4From 2 to 4years

19069

Page 41: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

23 23 2 Lower secondaryeducation

5 From 4 to 6years

4330

24 24 2Lower secondaryeducation

6From 6 to 8years

1525

25 25 2Lower secondaryeducation

7From 8 to10 years

684

26 26 2Lower secondaryeducation

8Over 10years

44730

28 28 3Post-secondarynon-tertiaryeducation

1Noexperience

10386

29 29 3Post-secondarynon-tertiaryeducation

2Up to 1year

257372

30 30 3Post-secondarynon-tertiaryeducation

3From 1 to 2years

34109

31 31 3Post-secondarynon-tertiaryeducation

4From 2 to 4years

53213

32 32 3Post-secondarynon-tertiaryeducation

5From 4 to 6years

8818

33 33 3Post-secondarynon-tertiaryeducation

6From 6 to 8years

2342

34 34 3Post-secondarynon-tertiaryeducation

7From 8 to10 years

1146

35 35 3Post-secondarynon-tertiaryeducation

8Over 10years

112518

37 37 4Upper secondaryeducation

1Noexperience

21002

38 38 4Upper secondaryeducation

2Up to 1year

220378

39 39 4Upper secondaryeducation

3From 1 to 2years

37237

40 40 4Upper secondaryeducation

4From 2 to 4years

48012

41 41 4 Upper secondary 5 From 4 to 6 8003

Page 42: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

education years

42 42 4Upper secondaryeducation

6From 6 to 8years

2696

... ... ... ... ... ... ...

48 48 5Short-cycletertiary education

3From 1 to 2years

118960

49 49 5Short-cycletertiary education

4From 2 to 4years

151542

50 50 5Short-cycletertiary education

5From 4 to 6years

31885

51 51 5Short-cycletertiary education

6From 6 to 8years

7097

52 52 5Short-cycletertiary education

7From 8 to10 years

2220

53 53 5Short-cycletertiary education

8Over 10years

154879

55 55 6Bachelor orequivalent

1Noexperience

20851

56 56 6Bachelor orequivalent

2Up to 1year

257164

57 57 6Bachelor orequivalent

3From 1 to 2years

52426

58 58 6Bachelor orequivalent

4From 2 to 4years

84984

59 59 6Bachelor orequivalent

5From 4 to 6years

16845

60 60 6Bachelor orequivalent

6From 6 to 8years

7468

61 61 6Bachelor orequivalent

7From 8 to10 years

2467

62 62 6Bachelor orequivalent

8Over 10years

116087

64 64 7Master orequivalent

1Noexperience

10289

65 65 7Master orequivalent

2Up to 1year

129748

66 66 7Master orequivalent

3From 1 to 2years

21435

67 67 7 Master or 4 From 2 to 4 43954

Page 43: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

equivalent years

68 68 7Master orequivalent

5From 4 to 6years

12525

69 69 7Master orequivalent

6From 6 to 8years

2131

70 70 7Master orequivalent

7From 8 to10 years

778

71 71 7Master orequivalent

8Over 10years

35596

73 73 8Doctoral orequivalent

1Noexperience

740

74 74 8Doctoral orequivalent

2Up to 1year

17649

75 75 8Doctoral orequivalent

3From 1 to 2years

3008

76 76 8Doctoral orequivalent

4From 2 to 4years

4720

77 77 8Doctoral orequivalent

5From 4 to 6years

1391

78 78 8Doctoral orequivalent

6From 6 to 8years

245

79 79 8Doctoral orequivalent

7From 8 to10 years

58

80 80 8Doctoral orequivalent

8Over 10years

4505

64 rows × 6 columns

Page 44: Cedefop DataLAB Access to data - Europa...Cedefop DataLAB Access to data In order to get data from Athena or Hive we need to have two things:A Query A Connection First let's import

In [257]:

# initial variablesplt.rcParams['figure.figsize'] = (20, 8)fig = plt.figure()

# plotting bubblesplt.scatter(edu_exp.educational_level, edu_exp.experience, s=edu_exp.num_ojv/100.0, alpha=0.5)

# Rotatinf X ticksfig.autofmt_xdate(rotation=90)

# Adding a title to the plotplt.title('Education Vs. Experience - Digital Occupations', fontsize=20)

# plotting!plt.show()