Talend Open Studio Data Integration
-
Upload
roberto-marchetto -
Category
Technology
-
view
5.189 -
download
9
description
Transcript of Talend Open Studio Data Integration
www.robertomarchetto.com
Data Integration
Data Integration involves combining data residing in differente sources and providing the
user with a unified view of the data
Data Management combines different disciplines to manage data as a valuable resource
www.robertomarchetto.com
Talend
● Talend is a company focused on Data Integration and Data Management solutions
● Talend is a „Cool Vendor“ for Gartner (2010)● Present in more than 12 locations around the
World● Fast growing company
www.robertomarchetto.com
Talend Open Studio
● Open Source, professional tool● Draw procedures linking components, each
component performs an operation● DB vendor-specific optimized components● Produces fully editable Java (or Perl) code● Deployment with small and fast compiled Java
or as Web Service● Eclipse based IDE, excellent flexibility● BI Platform indipendent, DB Vendor indipendent
www.robertomarchetto.com
Automatic code generation, diffent deployment
www.robertomarchetto.com
Extracion Transformation Loading
● ETL is a common process in Data Integration● Extract, reading data from different datasources
(database, flat files, spreadsheet files, web services, etc)
● Transfom, converting data in a form so that it can be placed in another container (database, web services, files, etc). Cleaning, computations and verifications are also performed
● Load, write the data in the target format
www.robertomarchetto.com
Tutorial, Destination data (Datawarehouse)
www.robertomarchetto.com
Tutorial, Metadata
● Talend requires a preliminary definition of the metadata
● Often a strong metadata definition means, as in programming languages, fast, robust and maintenable applications
● ..demo..
www.robertomarchetto.com
Tutorial, Talend jobs basics
● Place components on the designer● Link components to build a transformation● Main type of link: Rows flow● Schema metadata is propagated and must be
coherent● ..demo..
www.robertomarchetto.com
Extensibility, comunity plugins
● Many official components
● Components for every task released by the comunity
● Geospatial components, log analysis, Google analytics, data encryption, etc
www.robertomarchetto.com
And now.. reports, dashboards, OLAP, Geoanalysis, KPIs..
www.robertomarchetto.com
What about data quality?
● Customer A is present 5 times with different names
● Null values can vary statistical indexes like mean calculation
● Duplicated records● Blank values● Some records can contain errors (es -1 field
values)● Some records can be garbage
www.robertomarchetto.com
What abount data storage size?
● Some fields can be oversized for the data they contain
● Sometimes fields are related and can be calculated
● Some keys or values are never used● When data grow garbage grow● Data storage is not free (disks, electricity,
backups, DB licenses)
www.robertomarchetto.com
Data is „the black gold“ that can produce knowledge
● Data is a resource, you can extract knowledge● A lot of Data produces concise informations● Data storage is not free and a lot of data can
make system not fast● Data cleansing is a central process in statistical
analysis and Data Mining