CKAN is not a repository CKAN is a repository Introduction Collect ...
› files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success...
Transcript of › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success...
![Page 1: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/1.jpg)
October 4th, 2016
Automatic data publication in CKAN using Kettle
(a success case in Generalitat Valenciana)
![Page 2: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/2.jpg)
index
• Intro – a short story • Pentaho Data Integration (Kettle)
• What is Kettle? • Some features • Some screenshots
• Description of the solution • Architecture • Execution phases • Use of API
• Creating a new dataset: step by step • Conclusions
![Page 3: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/3.jpg)
Intro – a short story
• Lack of culture around data reusability
• Lack of resources to provide reusable data
Problems
![Page 4: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/4.jpg)
index
• Intro – a short story • Pentaho Data Integration (Kettle)
• What is Kettle? • Some features • Some screenshots
• Description of the solution • Architecture • Execution phases • Use of API
• Creating a new dataset: step by step • Conclusions
![Page 5: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/5.jpg)
Community Edition available Free license
What is Kettle?
Pentaho Business Analytics
Reporting
Analysis Services
Dashboard
Data Integration (PDI)
Data Mining
BI Server
Spoon
Pan Kitchen
Carte
PDI (also called Kettle) is the component of Pentaho responsible for the Extraction, Transformation and Loading (ETL) processes. ETL tools are most frequently used in datawarehouses environments, however Pentaho Data Integration (PDI) can also be used for other purposes:
• Migrating data between applications or databases
• Exporting data from databases to flat files
• Loading data massively into databases
• Data cleaning • Integrating applications
![Page 6: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/6.jpg)
Kettle - features
• Data flow control (bandwidth consumption)
• Parallel or sequential process execution
• Process scheduling
• Develop custom java classes / Java libraries
• Scripting (SQL, JavaScript, Shell)
Oracle
MySQL
PostgreSQL
SQLite
Sybase
SAP, Vertica, Palo, Hadoop…
JDBC, ODBC, OCI, JNDI
Network folders
FTP servers
REST and SOAP services
Supported datasources
Catalog Data Sources
IBM DB2
Hypersonic
Informix
MS SQL Server
dBase
Text files (XML, JSON,
CSV, RSS)
Excel
MS Acces
Some features
![Page 7: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/7.jpg)
Kettle - screenshots
Designing a transformation
![Page 8: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/8.jpg)
Just a bunch of commands… drag & drop… and configure!!
Kettle - screenshots
![Page 9: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/9.jpg)
Community and documentation links
Kettle - screenshots
![Page 10: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/10.jpg)
Execution log view: what happened?
Kettle - screenshots
![Page 11: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/11.jpg)
Verifying a transformation: warnings and errors details
Kettle - screenshots
![Page 12: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/12.jpg)
Debugging a transformation: breakpoint configuration
Kettle - screenshots
![Page 13: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/13.jpg)
index
• Intro – a short story • Pentaho Data Integration (Kettle)
• What is Kettle? • Some features • Some screenshots
• Description of the solution • Architecture • Execution phases • Use of API
• Creating a new dataset: step by step • Conclusions
![Page 14: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/14.jpg)
Datasources
architecture
AP
I
Databases
Files Input files
Output files
Business Intelligence
System
1
2 3 4
Orchestrator process 1.File
recovering
2.Import to BI
3.Resources generation
4.Resources uploading
![Page 15: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/15.jpg)
phases
• The main process runs phases in sequencial order. A phase never starts until the previous one has finished.
• Within each phase, the process is run for each dataset.
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
![Page 16: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/16.jpg)
phases
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
D1
• The main process runs phases in sequencial order. A phase never starts until the previous one has finished.
• Within each phase, the process is run for each dataset.
![Page 17: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/17.jpg)
phases
D1
D2
D3
• The main process runs phases in sequencial order. A phase never starts until the previous one has finished.
• Within each phase, the process is run for each dataset.
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
![Page 18: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/18.jpg)
phases
• The main process runs phases in sequencial order. A phase never starts until the previous one has finished.
• Within each phase, the process is run for each dataset.
D1
D2
D3
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
![Page 19: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/19.jpg)
phases
• The main process runs phases in sequencial order. A phase never starts until the previous one has finished.
• Within each phase, the process is run for each dataset.
D1
D2
D3
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
![Page 20: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/20.jpg)
phases
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
• The main process runs phases in sequencial order. A phase never starts until the previous one has finished.
• Within each phase, the process is run for each dataset.
![Page 21: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/21.jpg)
phases
How does the Kettle know what datasets are in which phase?
A database is used to store the information regarding the whole process and the state of execution of each dataset.
What if something goes wrong?
If something goes wrong with a dataset, it remains “stopped” in that phase until the next iteration of the main process.
![Page 22: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/22.jpg)
phases
What if something goes wrong?
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
D1
1st iteration
![Page 23: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/23.jpg)
phases
D1
D2
D3
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
What if something goes wrong?
1st iteration
![Page 24: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/24.jpg)
phases
D1
D2
D3
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
What if something goes wrong?
1st iteration
![Page 25: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/25.jpg)
phases
D1
D2
D3
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
What if something goes wrong?
1st iteration
![Page 26: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/26.jpg)
phases
D1
D2
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
What if something goes wrong?
1st iteration
![Page 27: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/27.jpg)
phases
D1
D2
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
What if something goes wrong?
2nd iteration
![Page 28: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/28.jpg)
phases
D1
D2
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
What if something goes wrong?
2nd iteration
![Page 29: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/29.jpg)
phases
D1
D2
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
What if something goes wrong?
2nd iteration
![Page 30: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/30.jpg)
phases
D1
D2
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
What if something goes wrong?
2nd iteration
![Page 31: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/31.jpg)
phases
1. File recovering
2. Import to BI 3. Resources generation
4. Resources uploading
D1
D2
D3
What if something goes wrong?
![Page 32: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/32.jpg)
4th phase – Use of API
ckan.logic.action.create
ckan.logic.action.create.package_create
ckan.logic.action.create.resource_create
ckan.logic.action.create.tag_create
ckan.logic.action.patch
ckan.logic.action.patch.package_patch
ckan.logic.action.patch.resource_patch
ckan.logic.action.update
ckan.logic.action.update.term_translation_update
ckan.logic.action.update.term_translation_update_many
ckan.logic.action.delete
ckan.logic.action.delete.package_delete
ckan.logic.action.delete.dataset_purge
ckan.logic.action.delete.resource_delete
• 4th phase - example of API functions being used:
![Page 33: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/33.jpg)
index
• Intro – a short story • Pentaho Data Integration (Kettle)
• What is Kettle? • Some features • Some screenshots
• Description of the solution • Architecture • Execution phases • Use of API
• Creating a new dataset: step by step • Conclusions
![Page 34: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/34.jpg)
Creating a dataset: step-by-step
1st Step: Create a new .properties file for your new dataset
• This file contains several properties related the new dataset. For instance: metadata in two languages, the type of datasource (file or BD), how to update the dataset regularly (daily, weekly, monthly, yearly).
• This file is read by the orchestrator process and is used to set the matadata when creating the dataset.
![Page 35: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/35.jpg)
Creating a dataset: step-by-step
2nd Step: Register the dataset in the database
• There is a script for launching these queries • This step is required to call the new dataset from the orchestrator process
![Page 36: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/36.jpg)
Creating a dataset: step-by-step
3rd Step: create the folder structure for the new dataset
• config: contains the properties file • input_files: contains source files • input_error_files: contains processed files if any error ocurred • output_files: contains files pending to be uploaded to CKAN • output_processed_files: contains a copy of the files uploaded to CKAN Files are moved from one folder to another, when every phase is finished
![Page 37: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/37.jpg)
Creating a dataset: step-by-step
4th Step: load process development
• This process reads data from the origin and load into the database (2nd phase) • Depending on the data, this process can be copied (templates) among several datasets • In any event, many steps of the process are reused (error handling, file/database access…)
![Page 38: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/38.jpg)
Creating a dataset: step-by-step
5th Step: “select” query to generate resources.
• Each dataset has a different “SELECT”. • This select is called by a common transformation process which generates CSV, XML and
JSON files.
SELECT UPPER (OWCKAN.CSV_UTIL_PKG.ARRAY_TO_CSV(OWCKAN.T_STR_ARRAY(CR_ANYO, CR_MES, CR_DESC_MES, CR_DEPTO_ATENCION, CR_SEXO, CR_COD_SEXO, CR_EDAD, CR_CITAS_REGISTRADAS),'';'')) AS CSV, ''1'' FROM OWCKAN.OD_SAN_IND_AT_PRIMARIA_CITAS LEFT JOIN OWCKAN.OD_SAN_AP_MD_EDAD ON CR_EDAD = AP_COD_EDAD LEFT JOIN OWCKAN.OD_SAN_MD_DEPTO ON CR_DEPTO_ATENCION = AH_DC_COD_DEPTO WHERE CR_ANYO = ''ANYO_CHANGE'' AND CR_MES = ''MES_CHANGE'' ORDER BY CR_DEPTO_ATENCION, CR_SEXO
![Page 39: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/39.jpg)
Creating a dataset: step-by-step
• Most of the steps are can be inmediately accomplished
• We get datasets fully automatized in less than one hour when using Kettle templates.
.properties Register dataset
Create folder structure
Process development
“select” query
5’ 10’ 5’ 2h – 16h 10’
![Page 40: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/40.jpg)
index
• Intro – a short story • Pentaho Data Integration (Kettle)
• What is Kettle? • Some features • Some screenshots
• Description of the solution • Architecture • Execution phases • Use of API
• Creating a new dataset: step by step • Conclusions
![Page 41: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/41.jpg)
Conclusions
• Lack of culture in some organizations about how to make reusability easier
• We have the data, we have the platform for publishing, we have to simplify the transition from the source to the portal.
• The easier we make this transition the more contributors will be willing to participate.
• Kettle is a good alternative, it‘s powerful, free, open source, already known by many IT departments.
• This platform is just an approach. You can design your own solution.
• Whenever you can, help people to produce the best reusable formats
![Page 42: › files.ckan.org › ckancon... · Automatic data publication in CKAN using Kettle (a success ...Automatic data publication in CKAN using Kettle (a success case in Generalitat Valenciana)](https://reader033.fdocuments.us/reader033/viewer/2022041818/5e5bf6ccc1fa85207b60553c/html5/thumbnails/42.jpg)