Teradata Parallel Transporter

21
© 2009 Informatica Corporation How to Use PowerCenter with Teradata to Load and Unload Data

description

Teradata

Transcript of Teradata Parallel Transporter

Page 1: Teradata Parallel Transporter

© 2009 Informatica Corporation

How to Use PowerCenter with Teradata to Load and Unload Data

Page 2: Teradata Parallel Transporter

Abstract This article explains how to move data between PowerCenter and Teradata databases. It explains when to use Teradata relational connections, Teradata load and unload utilities, or pushdown optimization to move data. This article also lists issues you might encounter when loading data to or unloading data from Teradata and the workarounds for these issues.

Table of Contents Overview ........................................................................................................................................................................... 3

Prerequisites................................................................................................................................................................. 4 Teradata Relational Connections...................................................................................................................................... 5

Creating a Teradata Relational Connection ................................................................................................................. 6 Standalone Load and Unload Utilities............................................................................................................................... 6

Teradata FastLoad ....................................................................................................................................................... 7 Teradata MultiLoad....................................................................................................................................................... 7 Teradata TPump........................................................................................................................................................... 7 Teradata FastExport ..................................................................................................................................................... 8

Teradata Parallel Transporter ........................................................................................................................................... 8 Pushdown Optimization .................................................................................................................................................... 9

Achieving Full Pushdown without Affecting the Source System................................................................................. 12 Achieving Full Pushdown with Parallel Lookups ........................................................................................................ 13 Achieving Pushdown with Sorted Aggregation........................................................................................................... 14 Achieving Pushdown for an Aggregator Transformation ............................................................................................ 14 Achieving Pushdown when a Transformation Contains a Variable Port .................................................................... 14 Improving Pushdown Performance in Mappings with Multiple Targets ...................................................................... 14 Removing Temporary Views when a Pushdown Session Fails.................................................................................. 15

Issues Affecting Loading to and Unloading from Teradata ............................................................................................. 16 Making 32-bit Load and Unload Utilities Work with 64-bit PowerCenter .................................................................... 16 Increasing Lookup Performance................................................................................................................................. 16 Performing Uncached Lookups with Date/Time Ports in the Lookup Condition ......................................................... 17 Restarting a Failed MultiLoad Job Manually............................................................................................................... 18 Configuring Sessions that Load to the Same Table ................................................................................................... 18 Setting the Checkpoint when Loading to Named Pipes ............................................................................................. 19 Loading from Partitioned Sessions............................................................................................................................. 19 Loading to Targets with Date/Time Columns ............................................................................................................. 19 Hiding Passwords....................................................................................................................................................... 20 Using Error Tables to Identify Problems during Loading ............................................................................................ 20

2

Page 3: Teradata Parallel Transporter

Overview Teradata is a global technology leader in enterprise data warehousing, business analytics, and data warehousing services. Teradata provides a powerful suite of software that includes the Teradata Database, data access and management tools, and data mining applications. PowerCenter works with the Teradata Database and Teradata tools to provide a data integration solution that allows you to integrate data from virtually any business system into Teradata as well as leverage Teradata data for use in other business systems.

PowerCenter uses the following techniques when extracting data from and loading data to the Teradata database:

ETL (extract, transform, and load). This technique extracts data from the source systems, transforms the data within PowerCenter, and loads it to target tables. The PowerCenter Integration Service transforms all data. If you use the PowerCenter Partitioning option, the Integration Service also parallelizes the workload.

ELT (extract, load, and then transform). This technique extracts data from the source systems, loads it to user-defined staging tables in the target database, and transforms the data within the target system using generated SQL. The SQL queries include a final insert into the target tables. The database system transforms all data and parallelizes the workload, if necessary.

ETL-T (ETL and ELT hybrid). This technique extracts data from the source systems, transforms the data within PowerCenter, loads the data to user-defined staging tables in the target database, and further transforms the data within the target system using generated SQL. The SQL queries include a final insert into the target tables. The ELT-T technique is optimized within PowerCenter so that the transformations that better perform within the database system can be performed there and the Integration Service performs the other transformations.

To perform ETL operations, configure PowerCenter sessions to use a Teradata relational connection, a Teradata standalone load or unload utility, or Teradata Parallel Transporter. To use ELT or ETL-T techniques, configure PowerCenter sessions to use pushdown optimization.

Use a Teradata relational connection to communicate with Teradata when PowerCenter sessions load or extract small amounts of data (<1 GB per session). Teradata relational connections use ODBC to connect to Teradata. ODBC is a native interface for Teradata. Teradata provides 32- and 64-bit ODBC drivers for Windows and UNIX platforms. The driver bit mode must be compatible with the bit mode of the platform on which the PowerCenter Integration Services runs. For example, 32-bit PowerCenter only runs with 32-bit drivers.

Use a standalone load or unload utility when PowerCenter sessions extract or load large amounts of data (>1 GB per session). Standalone load and unload utilities can increase session performance by loading or extracting data directly from a file or pipe rather than running the SQL commands to load or extract the same data. All Teradata standalone load and unload utilities are fully parallel to provide optimal and scalable performance for loading data to or extracting data from the Teradata Database. PowerCenter works with the Teradata FastLoad, MultiLoad, and TPump load utilities and the Teradata FastExport unload utility.

Use Teradata Parallel Transporter for PowerCenter sessions that must quickly load or extract large amounts of data (>1 GB per session). Teradata Parallel Transporter provides all of the capabilities of the standalone load and unload utilities, plus it provides more granular control over the load or unload process, enhanced monitoring capabilities, and the ability to automatically drop log, error, and work tables when a session starts. Teradata Parallel Transporter is a parallel, multi-function extract and load environment that provides access to PowerCenter using an open API. It can load dozens of files using a single control file. It also allows you to distribute the workload among several CPUs, eliminating bottlenecks in the data loading and extraction processes.

Use pushdown optimization to reduce the amount of data passed between Teradata and PowerCenter or when the Teradata database can process transformation logic faster than PowerCenter. Pushdown optimization improves session performance by “pushing” as much transformation logic as possible to the Teradata source or target database. PowerCenter processes any transformation logic that cannot be pushed to the database. For example, pushing Filter transformation logic to the source database can reduce the amount of data passed to PowerCenter, which decreases session run time. When you run a session configured for pushdown optimization, PowerCenter translates the

3

Page 4: Teradata Parallel Transporter

transformation logic into SQL queries and sends the queries to the Teradata database. The Teradata database executes the SQL queries to process the transformation logic.

Prerequisites Before you run sessions that move data between PowerCenter and Teradata, you might want to install Teradata client tools. You also need to locate the Teradata TDPID.

Teradata Client Tools Teradata client tools help you communicate with the Teradata database and debug problems that occur when a session loads data to or extracts data from the Teradata database.

You can install the following Teradata client tools:

BTEQ. A general-purpose, command-line utility (similar to Oracle SQL*Plus) that enables you to communicate with one or more Teradata databases.

Teradata SQL Assistant. A GUI-based tool that allows you to retrieve data from any ODBC-compliant database server and manipulate and store the data in desktop applications. Teradata Queryman is the older version of this tool.

Install BTEQ or Teradata SQL Assistant to help you debug problems that occur when loading to and extracting from Teradata. Both tools are included in the Teradata Utility Pack, which is available from Teradata.

TDPID The Teradata TPDID indicates the name of the Teradata instance and defines the name a client uses to connect to a server. When you use a Teradata Parallel Transporter or a standalone load or unload utility with PowerCenter, you must specify the TDPID in the connection properties.

The Teradata TDPID appears in the hosts file on the machines on which the Integration Service and PowerCenter Client run. By default, the hosts file appears in the following location:

UNIX: /etc/hosts

Windows: %SystemRoot%\system32\drivers\etc\hosts*

* The actual location is defined in the Registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\DataBasePath

The hosts file contains client configuration information for Teradata. In a hosts file entry, the TDPID precedes the string “cop1.”

For example, the hosts file contains the following entries: 127.0.0.1 localhost demo1099cop1 192.168.80.113 td_1 custcop1 192.168.80.114 td_2 custcop2 192.168.80.115 td_3 custcop3 192.168.80.116 td_4 custcop4

The first entry has the TDPID “demo1099.” This entry tells the Teradata database that when a client tool references the Teradata instance “demo1099,” it should direct requests to “localhost” (IP address 127.0.0.1).

The following entries have the same TDPID, “cust.” Multiple hosts file entries with the same TDPID indicate the Teradata instance is configured for load balancing among nodes. When a client tool attempts to reference Teradata instance “cust,” the Teradata database directs requests to the first node in the entry list, “td_1.” If it takes too long for the node to respond, the database redirects the request to the second node, and so on. This process prevents the first node, “td_1” from becoming overloaded.

4

Page 5: Teradata Parallel Transporter

Teradata Relational Connections Teradata relational connections use ODBC to connect to Teradata. PowerCenter uses the ODBC Driver for Teradata to retrieve metadata and read and write to Teradata. To establish ODBC connectivity between Teradata and PowerCenter, install the ODBC Driver for Teradata on each PowerCenter machine that communicates with Teradata. The ODBC Driver for Teradata is included in the Teradata Tools and Utilities (TTU). You can download the driver from the Teradata web site.

Use a Teradata relational connection when extracting or loading small data sets, usually <1 GB per session. In sessions that extract or load large amounts of data, a standalone load or unload utility or Teradata Parallel Transporter is usually faster than a Teradata relational connection.

PowerCenter works with the ODBC Driver for Teradata available in the following TTU versions: PowerCenter Versions TTU Versions

7.0 - 8.1.1 8.1

8.5 and later 8.2, 12.0

For more information about the TTU versions that work with PowerCenter, see the TTU Supported Platforms and Product Versions document, which is available from Teradata @Your Service.

Sessions that perform lookups on Teradata tables must use a Teradata relational connection. If a session performs a lookup on a large, static Teradata table, you might be able to increase performance by using FastExport to extract the data to a flat file and configuring the session to look up data in the flat file.

If you experience performance problems when using a Teradata relational connection, and you do not want to use a load or unload utility, you might be able to configure PowerCenter sessions to use pushdown optimization.

If you load or extract data using a Teradata relational connection on UNIX, you must verify the configuration of environment variables and the odbc.ini file on the machine on which the Integration Service runs. To verify the environment variable configuration, ensure the Teradata ODBC path precedes the Data Direct driver path information in the PATH and shared library path environment variables. Place the Teradata path before the Data Direct path because both sets of ODBC software use some of the same file names.

To verify the odbc.ini file configuration, make sure there is an entry for the Teradata ODBC driver in the “[ODBC Data Sources]” section of odbc.ini. The following excerpt from an odbc.ini file shows a Teradata ODBC driver (tdata.so) entry on Linux:

[ODBC Data Sources] intdv12=tdata.so [intdv12] Driver=/usr/odbc/drivers/tdata.so Description=NCR 3600 running Teradata V12 DBCName=intdv12 SessionMode=Teradata CharacterSet=UTF8 StCheckLevel=0 DateTimeFormat=AAA LastUser= Username= Password= Database= DefaultDatabase=

For more information about configuring odbc.ini, see the PowerCenter Configuration Guide and the ODBC Driver for Teradata User Guide.

5

Page 6: Teradata Parallel Transporter

Creating a Teradata Relational Connection When you create a Teradata (relational) connection object in the Workflow Manager, choose “Teradata,” and not “ODBC,” as the connection type in the connection properties. When you choose Teradata as the connection type, the Integration Service still uses Teradata ODBC to connect to Teradata.

Although both ODBC and Teradata connection types might work, the Integration Service communicates with the Teradata database more efficiently when you choose the Teradata connection type. This is especially true if you use pushdown optimization in a session. If you use pushdown optimization in a Teradata session with an ODBC connection type, the Integration Service generates database connection driver warning messages.

For more information about creating connection objects in the Workflow Manager, see the PowerCenter Workflow Basics Guide.

Standalone Load and Unload Utilities Teradata standalone load and unload utilities are fast, reliable tools that help you export large amounts of data from Teradata databases and load session target files into Teradata databases. Use a standalone load or unload utility when PowerCenter sessions extract or load large amounts of data. Standalone load and unload utilities are faster than Teradata relational connections because they load or extract data directly from a file or pipe rather than run SQL commands to load or extract the data.

PowerCenter works with the following Teradata standalone load and unload utilities:

FastLoad. Inserts large volumes of data into empty tables in a Teradata database.

MultiLoad. Updates, inserts, upserts, and deletes large volumes of data into empty or populated Teradata tables.

TPump. Inserts, updates, upserts, and deletes data in Teradata tables in near-real-time.

FastExport. Exports large data sets from Teradata tables or views to PowerCenter.

All of these load and unload utilities are included in the Teradata Tools and Utilities (TTU), available from Teradata.

PowerCenter supports all of these standalone load and unload utilities. Support for MultiLoad and TPump has been available since PowerCenter 6.0. Support for FastLoad was added in PowerCenter 7.0. Support for FastExport was added in PowerCenter 7.1.3.

Before you can configure a session to use a load or unload utility, create a loader or FastExport (application) connection in the PowerCenter Workflow Manager and enter a value for the TDPID in the connection attributes. For more information about creating connection objects in PowerCenter, see the PowerCenter Workflow Basics Guide.

To use a load utility in a session, configure the associated mapping to load to a Teradata target, configure the session to write to a flat file instead of a relational database, and select the loader connection for the session. To use FastExport in a session, configure the mapping to extract from a Teradata source, configure the session to read from FastExport instead of a relational database, and select the FastExport connection for the session. For more information about configuring a session to use a load or unload utility, see the PowerCenter Advanced Workflow Guide.

When a session transfers data between Teradata and PowerCenter, the following files are created:

A staging file or pipe. PowerCenter creates a staging file or named pipe for data transfer based on how you configure the connection. Named pipes are generally faster than staging files because data is transferred as soon as it appears in the pipe. If you use a staging file, data is not transferred until all data appears in the file.

A control file. PowerCenter generates a control file that contains instructions for loading or extracting data. PowerCenter creates the control file based on the loader or FastExport attributes you configure for the connection and the session.

A log file. The load or unload utility creates a log file and writes error messages to it. The PowerCenter session log indicates whether the session ran successfully, but does not contain load or unload utility error messages. Use the log file to debug problems that occur during data loading or extraction.

6

Page 7: Teradata Parallel Transporter

By default, loader staging, control, and log files are created in the target file directory. The FastExport staging, control, and log files are created in the PowerCenter temporary files directory. For more information about these files, see the PowerCenter Advanced Workflow Guide.

Teradata FastLoad Teradata FastLoad is a command-line utility that quickly loads large amounts of data to empty tables in a Teradata database. Use FastLoad for a high-volume initial load or for high-volume truncate and reload operations.

FastLoad is the fastest load utility, but it has the following limitations:

FastLoad uses multiple sessions to load data, but it can load data to only one table in a Teradata database per job.

It locks tables while loading data, preventing others and other instances of FastLoad from accessing the tables during data loading.

FastLoad only works with empty tables with no secondary indexes.

It can only insert data.

Teradata MultiLoad Teradata MultiLoad is a command-driven utility for fast, high-volume maintenance on multiple tables and views of a Teradata database. Each MultiLoad instance can perform multiple data insert, update, and delete operations on up to five different tables or views. MultiLoad optimizes operations that rapidly acquire, process, and apply data to Teradata tables. Use MultiLoad for large volume, incremental data loads.

MultiLoad has the following advantages:

MultiLoad is very fast. It can process millions of rows in a few minutes.

MultiLoad supports inserts, updates, upserts, deletes, and data-driven operations in PowerCenter.

You can use variables and embed conditional logic into MultiLoad control files.

MultiLoad supports sophisticated error recovery. It allows load jobs to be restarted without having to redo all of the prior work.

MultiLoad has the following limitations:

MultiLoad is designed for the highest possible throughput, so it can be very resource intensive.

It locks tables while loading data, preventing others and other instances of MultiLoad from accessing the tables during data loading.

Because of its “phased” nature, there are potentially inconvenient windows of time when MultiLoad cannot be stopped without losing access to target tables.

Teradata TPump Teradata TPump is a highly parallel utility that can continuously move data from data sources into Teradata tables without locking the affected table. TPump supports inserts, updates, deletes, and data-driver updates. TPump acquires row hash locks on a database table instead of table-level locks, so multiple TPump instances can load data simultaneously to the same table. TPump is often used to “trickle-load” a database table. Use TPump for low volume, online data loads.

TPump has the following advantages:

TPump can refresh database tables in near real-time.

TPump continuously loads data into Teradata tables without locking the affected tables, so users can run queries when TPump is running.

7

Page 8: Teradata Parallel Transporter

TPump is less resource-intensive than MultiLoad because it does not write to temporary tables.

Users can control the rate at which statements are sent to the Teradata database, limiting resource consumption.

It supports parallel processing.

TPump can always be stopped and all of its locks dropped with no ill effect.

TPump is not as fast as the other standalone loaders for large volume loads because it changes the same data block multiple times.

Teradata FastExport Teradata FastExport is a command-driven utility that uses multiple sessions to quickly transfer large amounts of data from Teradata sources to PowerCenter. Use FastExport to quickly extract data from Teradata sources.

FastExport has the following advantages:

It is faster than Teradata relational connections when extracting large amounts of data.

FastExport can be run in streaming mode, which avoids the need to stage the data file.

You can encrypt the data transfer between FastExport and the Teradata server.

FastExport is available for sources and pipeline lookups.

When you create a FastExport connection, verify the settings of the following connection attributes:

Data encryption. Enable this attribute to encrypt the data transfer between FastExport and the Teradata server so that unauthorized users cannot access the data being transferred across the network.

Fractional seconds. This attribute specifies the precision of the decimal portion of timestamp data. To avoid session failure or possible data corruption, make sure this value matches the timestamp precision of the column in the Teradata database.

For more information about configuring FastExport connection attributes, see the PowerCenter Advanced Workflow Guide.

Teradata Parallel Transporter Teradata Parallel Transporter (PT) is a client application that provides scalable, high-speed, parallel data extraction, loading, and updating. It uses and expands upon the functions and capabilities of the standalone Teradata load and unload utilities. Teradata PT supports a single scripting environment with different system operators for extracting and loading data. It also supports massive parallel extraction and loading, so if you partition a Teradata PT session, multiple Teradata PT instances can extract or load large amounts of data in the same database tables at the same time.

To provide the functionality of the standalone load and unload utilities, Teradata PT extracts or loads data using one of the following system operators:

Export. Exports large data sets from Teradata tables or views and imports the data to PowerCenter for processing using the FastExport protocol.

Load. Bulk loads large volumes of data into empty Teradata database tables using the FastLoad protocol.

Update. Batch updates, inserts, upserts, and deletes data in Teradata database tables using the MultiLoad protocol.

Stream. Continuously updates, inserts, upserts, and deletes data in near real-time using the TPump protocol.

Teradata PT has the following advantages:

Teradata PT is up to 20% faster than the standalone Teradata load and unload utilities, even though it uses the underlying protocols from the standalone utilities.

8

Page 9: Teradata Parallel Transporter

Teradata PT supports recovery for sessions that use the Stream operator when the source data is repeatable. This feature is especially useful when running real-time sessions and streaming the changes to Teradata.

Users can invoke Teradata PT through a set of open APIs that communicate with the database directly, eliminating the need for a staging file or pipe and a control file.

Teradata PT eliminates the need to invoke different load and unload utilities to extract and load data.

PowerCenter communicates with Teradata PT using PowerExchange for Teradata Parallel Transporter, which is available through the Informatica-Teradata Enterprise Data Warehousing Solution. PowerExchange for Teradata Parallel Transporter was released with PowerCenter 8.1.1.

PowerExchange for Teradata Parallel Transporter provides integration between PowerCenter and Teradata databases for data extraction and loading. PowerExchange for Teradata Parallel Transporter executes Teradata PT operators directly through API calls. This improves performance by eliminating the staging file or named pipe. It also improves security by eliminating the control file, so there is no need to overwrite or store passwords in the control file. PowerExchange for Teradata Parallel Transporter supports session and workflow recovery. It also captures Teradata PT error messages and displays them in the session log, so you do not need to check the utility log file when errors occur.

Before you can configure a session to use Teradata PT, you must you must create a Teradata PT (relational) connection in the Workflow Manager and enter a value for the TDPID in the connection attributes. To configure a session to extract data, configure the associated mapping to read from Teradata, change the reader type for the session to Teradata Parallel Transporter Reader, and select the Teradata PT connection. To configure a session to load data, configure the associated mapping to load to Teradata, change the writer type for the session to Teradata Parallel Transporter Writer, and select the Teradata PT connection. In sessions that load to Teradata, you can also configure an ODBC connection that is used to automatically create the recovery table in the target database and drop the log, error, and work tables if a session fails.

For more information about using PowerExchange for Teradata Parallel Transporter, see the PowerExchange for Teradata Parallel Transporter User Guide.

Pushdown Optimization When you run sessions that move data between PowerCenter and Teradata databases, you might be able to improve session performance using pushdown optimization. Pushdown optimization allows you to “push” PowerCenter transformation logic to the Teradata source or target database. The PowerCenter Integration Service translates the transformation logic into SQL queries and sends the SQL queries to the database. The Teradata database executes the SQL queries to process the mapping logic. The Integration Service processes any mapping logic it cannot push to the database.

9

Page 10: Teradata Parallel Transporter

The following figure illustrates how pushdown optimization works with a Teradata database system:

The following figure shows a mapping in which you can increase performance using pushdown optimization:

If you configure this mapping for pushdown optimization, the Integration Service generates an SQL query based on the Filter and Lookup transformation logic and pushes the query to the source database. This improves session performance because it reduces the number of rows sent to PowerCenter. The Integration Service processes the Java transformation logic since that cannot be pushed to the database, and then loads data to the target.

Use pushdown optimization to improve the performance of sessions that use Teradata relational connections to connect to Teradata. In general, pushdown optimization can improve session performance in the following circumstances:

When it reduces the number of rows passed between Teradata and PowerCenter. For example, pushing a Filter transformation to the Teradata source can reduce the number of rows PowerCenter extracts from the source.

When the database server is more powerful than the PowerCenter server. For example, pushing a complex Expression transformation to the source or target improves performance when the database server can perform the expression faster than the server on which the PowerCenter Integration Service runs.

When the generated query can take advantage of prebuilt indexes. For example, pushing a Joiner transformation to the Teradata source improves performance when the database can join tables using indexes and statistics that PowerCenter cannot access.

ETL

Staging Warehouse

Teradata Source Database

Data ServerRepository Server Repository • Visual, codeless

environment. • “ELT” MPP-

based performance. • Job control and

logging. • Automatic scalability.

ELT Pushdown Processing

Today

• Full metadata

ETL

SQL

Teradata Target Database

10

Page 11: Teradata Parallel Transporter

Pushdown optimization is available with the PowerCenter Pushdown Optimization Option and has been supported since PowerCenter 8.0. To configure a session to use pushdown optimization, choose a Pushdown Optimization type in the session properties. You can select one of the following pushdown optimization types:

None. The Integration Service does not push any transformation logic to the database.

Source-side. The Integration Service analyzes the mapping from the source to the target or until it reaches a downstream transformation it cannot push to the database. It pushes as much transformation logic as possible to the source database.

The Integration Service generates SQL in the following form: SELECT … FROM source … WHERE (filter/join condition) … GROUP BY

Target-side. The Integration Service analyzes the mapping from the target back to the source or until it reaches an upstream transformation it cannot push to the database. It pushes as much transformation logic as possible to the target database.

The Integration Service generates SQL in the following form: INSERT INTO target(…) VALUES (?+1, UPPER(?))

Full. The Integration Service attempts to push all transformation logic to the target database. If the Integration Service cannot push all transformation logic to the database, it performs both source-side and target-side pushdown optimization.

The Integration Service generates SQL in the following form: INSERT INTO target(…)SELECT … FROM source …

$$PushdownConfig. Allows you to run the same session with different pushdown optimization configurations at different times.

The Integration Service can push the logic for the following transformations to Teradata: Transformation Pushdown Types

Aggregator Source-side, Full

Expression* Source-side, Target-side, Full

Filter Source-side, Full

Joiner Source-side, Full

Lookup, connected Source-side, Full

Lookup, unconnected Source-side, Target-side, Full

Router Source-side, Full

Sorter Source-side, Full

Source Qualifier Source-side, Full

Target Target-side, Full

Union Source-side, Full

Update Strategy Full

* PowerCenter expressions can be pushed down only if there is an equivalent database function. To work around this issue, you can enter an SQL override in the source qualifier.

When you use pushdown optimization with sessions that extract from or load to Teradata, you might need to modify mappings or sessions to take full advantage of the performance improvements possible with pushdown optimization. You might also encounter issues if a pushdown session fails.

11

Page 12: Teradata Parallel Transporter

For example, you might need to perform the following tasks:

Achieve full pushdown optimization without affecting the source. To achieve full pushdown optimization for a session in which the source and target reside in different database management systems, you can stage the source data in the Teradata target database. For more information, see “Achieving Full Pushdown without Affecting the Source System” on page 12.

Achieve full pushdown optimization with parallel lookups. To achieve full pushdown optimization for a mapping that contains parallel lookups, redesign the mapping to serialize the lookups. For more information, see “Achieving Full Pushdown with Parallel Lookups” on page 13.

Achieve pushdown optimization with sorted aggregation. To achieve pushdown optimization for a mapping that contains a Sorter transformation before an Aggregator transformation, redesign the mapping to remove the Sorter transformation. For more information, see “Achieving Pushdown with Sorted Aggregation” on page 14.

Achieve pushdown optimization for an Aggregator transformation with pass-through ports. To achieve pushdown optimization for a mapping that contains an Aggregator transformation with pass-through ports, redesign the mapping to remove the pass-through ports from the Aggregator transformation. For more information, see “Achieving Pushdown for an Aggregator Transformation” on page 14.

Achieve pushdown optimization when a transformation contains a variable port. To achieve pushdown optimization for a mapping that contains a transformation with a variable port, update the expression to eliminate the variable port. For more information, see “Achieving Pushdown when a Transformation Contains a Variable Port” on page 14.

Improve pushdown performance in mappings with multiple targets. To increase performance when using full pushdown optimization for mappings with multiple targets, you can stage the target data in the Teradata database. For more information, see “Improving Pushdown Performance in Mappings with Multiple Targets” on page 14.

Remove temporary views after a session that uses an SQL query fails. If you run a pushdown session that uses an SQL query, and the session fails, the Integration Service might not drop the views it creates in the source database. You can remove the views manually. For more information, see “Removing Temporary Views when a Pushdown Session Fails” on page 15.

For more information about pushdown optimization, see the PowerCenter Advanced Workflow Guide and the PowerCenter Performance Tuning Guide.

Achieving Full Pushdown without Affecting the Source System You can stage source data in the Teradata target database to achieve full pushdown optimization. Stage source data in the target when the mapping contains a source that does not reside in the same database management system as the Teradata target.

For example, the following mapping contains an OLTP source and a Teradata target:

Since the source and target tables reside in different database management systems, you cannot configure the session for full pushdown optimization as it is. You could configure the session for source-side pushdown optimization, which would push the Filter and Lookup transformation logic to the source. However, pushing transformation logic to a transactional source might reduce performance of the source database.

To avoid the performance problems caused by pushing transformation logic to the source, you can reconfigure the mapping to stage the source data in the target database.

12

Page 13: Teradata Parallel Transporter

To achieve full pushdown optimization, redesign the mapping as follows:

1. Create a simple, pass-through mapping to pass all source data to a staging table in the Teradata target database:

Configure the session to use Teradata PT or a standalone load utility to load the data to the staging table. Do not configure the session to use pushdown optimization.

2. Configure the original mapping to read from the staging table:

Configure the session to use full pushdown optimization. The Integration Service pushes all transformation logic to the Teradata database, increasing session performance.

Achieving Full Pushdown with Parallel Lookups The PowerCenter Integration Service cannot push down mapping logic that contains parallel Lookup transformations. The Integration Service processes all transformations after a pipeline branch when multiple Lookup transformations are present in different branches of pipeline, and the branches merge downstream.

For example, the Integration Service cannot fully push down the following mapping:

To achieve full pushdown optimization, redesign the mapping so that the lookups are serialized as follows:

When you serialize the Lookup transformations, the Integration Service generates an SQL query in which the lookups become part of a subquery. The Integration Service can then push the entire query to the source database.

13

Page 14: Teradata Parallel Transporter

Achieving Pushdown with Sorted Aggregation The Integration Service cannot push an Aggregator transformation to Teradata if it is downstream from a Sorter transformation. The Integration Service processes the Aggregator transformation.

For example, the Integration Service cannot push down the Aggregator transformation in the following mapping:

To redesign this mapping to achieve full or source-side pushdown optimization, configure the Aggregator transformation so that it does not use sorted input, and remove the Sorter transformation. For example:

Achieving Pushdown for an Aggregator Transformation The Integration Service cannot push an Aggregator transformation to Teradata if the Aggregator transformation contains pass-through ports. To achieve source-side or full pushdown optimization for a mapping that contains an Aggregator transformation with pass-through ports, redesign the mapping to remove the pass-through ports from the Aggregator transformation.

Achieving Pushdown when a Transformation Contains a Variable Port The Integration Service cannot push down transformation logic when the transformation contains a variable port. To achieve pushdown optimization for a mapping that contains a transformation with a variable port, update the transformation expression to eliminate the variable port. For example, a transformation contains a variable and an output port with the following expressions:

Variable port expression: NET_AMOUNT = AMOUNT - FEE

Output port expression: DOLLAR_AMT = NET_AMOUNT * RATE

To achieve pushdown optimization for the mapping, remove the variable port and reconfigure the output port as follows:

Output port expression: DOLLAR_AMT = (AMOUNT - FEE) * RATE

Improving Pushdown Performance in Mappings with Multiple Targets If you configure a mapping that contains complex transformation logic and multiple targets for full pushdown optimization, the Integration Service generates one “INSERT … SELECT …” SQL query for each target. This makes pushdown optimization inefficient because it can cause duplicate processing of complex transformation logic within the database. To improve session performance, redesign the original mapping to stage the target data in the Teradata database. Then create a second mapping that uses the staging table as the source.

14

Page 15: Teradata Parallel Transporter

For example, the following mapping contains two Teradata sources and two Teradata targets, all in the same RDBMS:

To achieve full pushdown optimization, redesign the mapping as follows:

1. Configure the original mapping to write to a staging table in the Teradata target database:

Configure the session to use full pushdown optimization.

2. Create a second mapping to pass all target data from the staging table to the Teradata targets:

Configure the session to use full pushdown optimization.

Removing Temporary Views when a Pushdown Session Fails In a mapping, the Source Qualifier transformation provides the SQL Query option to override the default query. You can enter an SQL statement supported by the source database. When you override the default SQL query for a session configured for pushdown optimization, the Integration Service creates a view to represent the SQL override. It then runs an SQL query against this view to push the transformation logic to the database.

To use an SQL override in a session configured for pushdown optimization, enable the Allow Temporary View for Pushdown option in the session properties. This option allows the Integration Service to create temporary view objects in the database when it pushes the session to the database. The Integration Service uses a prefix of PM_V for the view objects it creates. When the session completes, the Integration Service drops the view from the database. If the session does not complete successfully, the Integration Service might not drop the view.

To search for views generated by the Integration Service, run the following query against the Teradata source database:

SELECT TableName FROM DBC.Tables WHERE CreatorName = USER

15

Page 16: Teradata Parallel Transporter

AND TableKind ='V' AND TableName LIKE 'PM\_V%' ESCAPE '\'

To avoid problems when you run a pushdown session that contains an SQL override, use the following guidelines:

Ensure that the SQL override syntax is compatible with the Teradata source database. PowerCenter does not validate the syntax, so test the query before you push it to the database.

Do not use an order by clause in the SQL override.

Use ANSI outer join syntax in the SQL override. If the Source Qualifier transformation contains Informatica outer join syntax in the SQL override, the Integration Service processes the Source Qualifier transformation logic.

If the Source Qualifier transformation is configured for a distinct sort and contains an SQL override, the Integration Service ignores the distinct sort configuration.

If the Source Qualifier contains multiple partitions, specify the SQL override for all partitions.

Do not use a Sequence Generator transformation in the mapping. Teradata does not have a sequence generator function or operator.

Issues Affecting Loading to and Unloading from Teradata This section describes issues you might encounter when you move data between PowerCenter and Teradata.

Making 32-bit Load and Unload Utilities Work with 64-bit PowerCenter Applies to: FastLoad, MultiLoad, TPump, FastExport

If you use 64-bit PowerCenter, you need to reset the library path to make PowerCenter work with the32-bit Teradata load and unload utilities. You must reset the library path before you can run a session that invokes a load or unload utility.

To reset the library path, you need to replace the loader or FastExport executable with a shell script. The following procedure explains how to reset the library path for TPump on AIX. You can use the same method to reset the library path for the other utilities on Linux or other UNIX operating systems.

To reset the library path: 1. Create a shell script like the following called <executable>_infa, for example, “tpump_infa”:

#!/bin/sh LIBPATH=/usr/lib;export LIBPATH COPLIB=/usr/lib;export COPLIB COPERR=/usr/lib;export COPERR PATH=$PATH:$INFA_HOME/server/infa_shared/TgtFiles exec tpump "$@" exit $?

2. In the loader connection in the Workflow Manager, set the External Loader Executable attribute (for a load utility) or the Executable Name attribute (for FastExport) to the name of the shell script. So for Tpump, change the External Loader Executable from “tpump” to “tpump_infa.”

Increasing Lookup Performance Applies to: Teradata relational connections, FastExport

Sessions that perform lookups on Teradata tables must use Teradata relational connections. If you experience performance problems when running a session that performs lookups against a Teradata database, you might be able to increase performance in the following ways:

Use FastExport to extract data to a flat file and perform the lookup on the flat file.

Enable or disable the Lookup Cache.

16

Page 17: Teradata Parallel Transporter

Using FastExport to Extract Lookup Data If a session performs a lookup on a large, static Teradata table, you might be able to increase performance by using FastExport to extract the data to a flat file and configuring the session to look up data in the flat file.

To do this, redesign the mapping as follows:

1. Create a simple, pass-through mapping to pass the lookup data to a flat file. Configure the session to extract data to the flat file using FastExport.

2. Configure the original mapping to perform the lookup on the flat file.

Note: If you redesign the mapping using this procedure, you can further increase performance by specifying an ORDER BY clause on the FastExport SQL and enabling the Sorted Input property for the lookup file. This prevents PowerCenter from having to sort the file before populating the lookup cache.

Enabling or Disabling the Lookup Cache In sessions that perform lookups on Teradata tables, you might also be able to increase performance by enabling or disabling the lookup cache. When you enable lookup caching, the Integration Service queries the lookup source once, caches the values, and looks up values in the cache during the session. The lookup uses ODBC to populate the cache. When you disable lookup caching, each time a row passes into the transformation, the Integration Service issues a select statement to the lookup source for lookup values.

Enabling the lookup cache has the following advantages:

The Integration Service can search the cache very quickly.

Caches can be kept completely in memory.

Using a lookup cache prevents the Integration Service from making many separate calls to the database server.

The result of the Lookup query and processing is the same, whether or not you cache the lookup table. However, using a lookup cache can increase session performance for relatively static data in smaller lookup tables. Generally, it is better to cache lookup tables that need less than 300 MB.

For data that changes frequently or is stored in larger lookup tables, disabling caching can improve overall throughput. Do not cache the lookup tables in the following circumstances:

The lookup tables are so large that they cannot be stored on the local system.

There are not enough inodes or blocks to save the cache files.

You are not allowed to save cache files on the Informatica system.

The amount of time needed to build the cache exceeds the amount of time saved by caching.

To enable or disable the lookup cache, enable or disable the Lookup Caching Enabled option in the Lookup transformation properties. For more information about the lookup cache, see the PowerCenter Transformation Guide and the PowerCenter Performance Tuning Guide.

Performing Uncached Lookups with Date/Time Ports in the Lookup Condition Applies to: Teradata relational connections

When the Integration Service performs an uncached lookup on a Teradata database, the session fails if any transformation port in the lookup condition contains a date/time port. The Integration Service writes the following Teradata error message to the session log:

[][ODBC Teradata Driver][Teradata RDBMS] Invalid operation on an ANSI Datetime or Interval value.

17

Page 18: Teradata Parallel Transporter

To work around this issue, perform either of the following actions:

Apply the Teradata ODBC patch 3.2.011 or later and remove “NoScan=Yes” from the odbc.ini file.

Configure the Lookup transformation to use a lookup cache or remove the Date/Time port from the lookup condition.

Restarting a Failed MultiLoad Job Manually Applies to: MultiLoad

When loading data, MultiLoad puts the target table into the “MultiLoad” state and creates a log table for the target table. After successfully loading the data, it returns the target table to the normal (non-MultiLoad) state and deletes the log table. When you load data using MultiLoad, and the MultiLoad job fails for any reason, MultiLoad reports an error, and leaves the target table in the Multi-Load state. Additionally, MultiLoad queries the log table to check for errors. If a target table is in the MultiLoad state or if a log table exists for the target table, you cannot restart the job.

To recover from a failed MultiLoad job, you must release the target table from the MultiLoad state and drop the MultiLoad log table. To do this, enter the following commands using BTEQ or Teradata SQL Assistant:

drop table ML_<table name>; release mload <table name>;

Note that PowerCenter adds the “ML_” prefix to the MultiLoad log table name. If you use a hand-coded MultiLoad control file, the log table can have any name.

For example, to recover from a failed job that attempted to load data to table “td_test” owned by user “infatest,” enter the following commands using BTEQ:

BTEQ -- Enter your DBC/SQL request or BTEQ command: drop table infatest.mldlog_td_test; drop table infatest.mldlog_td_test; *** Table has been dropped. *** Total elapsed time was 1 second. BTEQ -- Enter your DBC/SQL request or BTEQ command: release mload infatest.td_test; release mload infatest.td_test; *** Mload has been released. *** Total elapsed time was 1 second.

Configuring Sessions that Load to the Same Table Applies to: MultiLoad

While Teradata MultiLoad loads data to a database table, it locks the table. MultiLoad requires that all instances handle wait events so they do not try to access the same table simultaneously.

If you have multiple PowerCenter sessions that load to the same Teradata table using MultiLoad, set the Tenacity attribute for the session to a value that is greater than the expected run time of the session. The Tenacity attribute controls the amount of time a MultiLoad instance waits for the table to become available. Also configure each session to use unique log file names.

For more information about the Tenacity attribute, see the PowerCenter Advanced Workflow Guide.

18

Page 19: Teradata Parallel Transporter

Setting the Checkpoint when Loading to Named Pipes Applies to: FastLoad, MultiLoad, TPump

If you configure a session to load to Teradata using a named pipe, set the checkpoint loader attribute to 0 to prevent the loader from performing checkpoint operations. Teradata loaders use checkpoint values to recover or restart a failed loader job. When a loader job that uses a staging file fails, you can restart it from the last checkpoint. When the loader uses a named pipe, checkpoints are not used.

Setting the checkpoint attribute to 0 increases loader performance, since the loader job does not have to keep track of checkpoints. It also prevents the “broken pipe” errors and session failures that can occur when a nonzero checkpoint is used with a named pipe.

Loading from Partitioned Sessions Applies to: FastLoad, MultiLoad

When you configure multiple partitions in a session that uses staging files, the Integration Service creates a separate flat file for each partition. Since FastLoad and MultiLoad cannot load data from multiple files, use round-robin partitioning to route the data to a single file. When you do this, the Integration Service writes all data to the first partition and starts only one instance of FastLoad or MultiLoad. It writes the following message in the session log:

MAPPING> DBG_21684 Target [TD_INVENTORY] does not support multiple partitions. All data will be routed to the first partition.

If you do not route the data to a single file, the session fails with the following error: WRITER_1_*_1> WRT_8240 Error: The external loader [Teradata Mload Loader] does not support partitioned sessions. WRITER_1_*_1> Thu Jun 16 11:58:21 2005 WRITER_1_*_1> WRT_8068 Writer initialization failed. Writer terminating.

For more information about loading from partitioned sessions, see the PowerCenter Advanced Workflow Guide.

Loading to Targets with Date/Time Columns Applies to: FastLoad, MultiLoad, TPump, Teradata PT

The target date format determines the format in which dates can be loaded into the column. PowerCenter only supports a limited set of Teradata date formats. Therefore, you must check the target date format to avoid problems loading date/time data.

When you create a date/time column in a Teradata database table, you specify the display format for the date/time values. The format you choose determines the format in which date/time values are displayed by Teradata client tools as well as the format in which date/time values can be loaded into the column. For example a column in a Teradata table has the date format “yyyy/mm/dd.” If you run a PowerCenter session that loads a date with the format “mm/dd/yyyy” into the column, the session fails.

Before running a session that loads date/time values to Teradata, you verify that the format of each date/time column in the mapping matches the format of the corresponding date/time column in the Teradata target. If the session loads values into multiple date/time columns, check the format of each date/time column in the target because different tables often use different date/time formats. You can use Teradata BTEQ or SQL Assistant to check the format for a date/time column in a Teradata database.

If any column in the Teradata target uses the “yyyyddd” date format (4-digit year followed by the 3-digit day), you must either redefine the date format in the Teradata table or convert the date to a character string in PowerCenter. Redefining the date format in the Teradata table does not change the way Teradata stores the date internally.

19

Page 20: Teradata Parallel Transporter

To convert a Teradata “yyyyddd” date column to a character column in PowerCenter: 1. Edit the target table definition in PowerCenter and change the date column data type from “date” to “char(7).”

2. Create an Expression transformation with the following expression to convert the date into a string with the format “yyyyddd”: to_char(date_port,’yyyy’) || to_char(date_port,’ddd’)

Note: The expression to_char(date_port, ‘yyyyddd’) does not work.

3. Link the output port in the Expression transformation to the “char(7)” column in the target definition.

Hiding Passwords Applies to: FastExport, FastLoad, MultiLoad, TPump, Teradata PT

When you create a loader or application (FastExport) connection object, you enter the database user name and password in the connection properties. The Integration Service writes the password in the control file in plain text and the Teradata loader does not encrypt the password. To prevent the password from appearing in the control file, enter “PMNullPasswd” as the password. When you do this, the Integration Service writes an empty string for the password in the control file.

If you do not want to use “PMNullPasswd,” perform either of following actions:

Lock the control file directory.

For load utilities, configure PowerCenter to write the control file to a different directory, and then secure that directory.

By default, the PowerCenter Integration Service writes the loader control file to the target file directory and the FastExport control file to the temp file directory. To write the loader control file to a different directory, set the LoaderControlFileDirectory custom property to the new directory for the Integration Service or session. For more information about setting custom properties for the Integration Service, see the PowerCenter Administrator Guide. For more information about setting custom properties for the session, see the PowerCenter Workflow Basics Guide.

Finally, MultiLoad and TPump support the RUN FILE command. This command directs control from the current control file to the control file specified in the login script. Place the login statements in a file in a secure location, and then add the RUN FILE command to the generated control file to call it. Run “chmod -w” on the control file to prevent PowerCenter from overwriting it.

For example, create a login script as follows (in the file “login.ctl” in a secure directory path): .LOGON demo1099/infatest,infatest;

Modify the generated control file and replace the login statement with the following command: .RUN FILE <secure_directory_path>/login.ctl;

Using Error Tables to Identify Problems during Loading Applies to: FastLoad, MultiLoad, TPump

When problems occur during loading data, the Teradata standalone load utilities generate error tables. (FastExport generates an error log file.) The load utilities generate different errors during the different phases of loading data.

FastLoad jobs run in two main phases: loading and end loading. During the loading phase, FastLoad initiates the job, locks the target table, and loads the data. During the end loading phase, the Teradata database distributes the rows of data to the target table and unlocks it. FastLoad requires an exclusive lock on the target table during the loading phase.

MultiLoad also loads data during two main phases: acquisition and application. In the acquisition phase, MultiLoad reads the input data and writes it to a temporary work table. In the application phase, MultiLoad writes the data from the work table to the actual target table. MultiLoad requires an exclusive lock on the target table during the application phase.

20

Page 21: Teradata Parallel Transporter

21

Tpump loads data in a single phase. It converts the SQL in the control file into a database macro and applies the macro to the input data. TPump uses standard SQL and standard table locking.

The following table lists the error tables you can check to troubleshoot load or unload utility errors: Utility Data Loading Phase Default Error Table Name Error Types

FastLoad • Loading

• End loading

• ET_<target_table_name>

• UV_<target_table_name>

• Constraint violations, conversion errors, unavailable AMP conditions

• Unique primary index violations

MultiLoad • Acquisition

• Application

• ET_<target_table_name>

• UV_<target_table_name>

• All acquisition phase errors, application phase errors if the Teradata database cannot build a valid primary index

• Uniqueness violations, field overflow on columns other than primary index fields, constraint errors

TPump n/a (single phase) • ET_<target_table_name> <partition_number>

• All TPump errors

When a load fails, check the “ET_” error table first for specific information. The ErrorField or ErrorFieldName column indicates the column in the target table that could not be loaded. The ErrorCode field provides details that explain why the column failed. For MultiLoad and TPump, the most common ErrorCodes are:

2689: Trying to load a null value into a non-null field

2665: Invalid date format

In the MultiLoad “UV_” error table, you can also check the DBCErrorField column and DBCErrorCode field. The DBCErrorField column is not initialized in the case of primary key uniqueness violations. The DBCErrorCode that corresponds to a primary key uniqueness violation is 2794.

For more information about Teradata error codes, see the Teradata documentation.

Authors Lori Troy Senior Technical Writer, Informatica Corporation

Chai Pydimukkala Senior Product Manager, Informatica Corporation

Acknowledgements The authors would like to thank Guy Boo, Ashlee Brinan, Eugene Ding, Stan Dorcey, Anudeep Sharma, Lalitha Sundaramurthy, Raymond To, Rama Krishna Tumrukoti, Sonali Verma, and Rajeeva Lochan Yellanki at Informatica for their assistance with this article. Additionally, the authors would like to thank Edgar Bartolome, Steven Greenberg, John Hennessey, and Michael Klassen at Teradata and Stephen Knilans and Michael Taylor at LoganBritton for their technical assistance.