Best Practices Using Teradata With PowerCenter

Best practices using Teradata with PowerCenter

Content

This document discusses how to use Teradata with PowerCenter. It covers Teradata basics and also describes some tweaks which may be

necessary to adequately deal with some of the common practices.

Teradata Basics

Teradata is a relational database management system from NCR. It offers high performance for very large databases using a highly parallel

architecture. While Teradata can run on other platforms, it is predominantly found on NCR hardware (which runs NCR's version of UNIX). It is

very fast and very scalable.

Teradata HardwareThe NCR computers on which Teradata runs support both MPP (Massively Parallel Processing) and SMP (Symmetric Multi-Processing).

Each MPPnode (or semi-autonomous processing unit) can support SMP.

Teradata can be configured to communicate directly with a mainframe's I/O channel. This is known aschannel attached . Alternatively, it can

benetwork attached . That is, configured to communicate via TCP/IP over a LAN.

Channel attached is not always faster thannetwork attached . Similar performance has been observed across a channel attachment as well

as a 100 MB LAN. In addition,channel attached requires an additional sequential data move because the data must be moved from the

PowerCenter server to the mainframe prior to moving the data across the mainframe channel to Teradata.

Teradata SoftwareThere are Teradata Director Program Ids (TDPIDs), databases and users. The TDPID is the name that is used to connect from a Teradata

client to a Teradata server (similar to Oracle thetnsnames.ora entry). Teradata databases and users are somewhat synonymous. A user has

a userid, password and space to store tables. A database is basically a user without a login and password (or a user is a database with a

userid and password).

Teradata Access Module Processors (AMPs) are Teradata's parallel database engines. Although they are strictly software ('virtual

processors' according to NCR terminology), Teradata refers to AMP and hardware node interchangeably since in the past an AMP was a

piece of hardware.

Client Configuration Basics for Teradata

The client side configuration is done using thehosts file (/etc/hosts on UNIX or winnt\system32\drivers\etc\hosts on Windows).

In thehosts file the name of the Teradata instance (i.e.tdpid - Teradata Director Program Id) is indicated by the letters and numbers that

precede the stringcop1 in a hosts file entry.

Example:

127.0.0.1 localhost demo1099cop1192.168.80.113 curly pcop1

This tells Teradata that when a client tool references the instancedemo1099 , it should direct requests tolocalhost (or IP address 127.0.0.1),

when a client tool references instance p, this located on the servercurly (or IP address 192.168.80.113). This entry does not contain any kind

of database server specific information (the TDPID is not the same as an Oracle instance ID). That is, the TDPID is used strictly to define the

name a client uses to connect to a server. Teradata takes the name you specify, looks in thehosts file to map the cop1 (orcop2 , etc.) to an IP

address, and then attempts to establish a connect with Teradata at the IP address.

There can be multiple entries in a hosts file with similar TDPIDs:

127.0.0.1 localhost demo1099cop1192.168.80.113 curly_1 pcop1192.168.80.114 curly_2 pcop2192.168.80.115 curly_3 pcop3192.168.80.116 curly_4 pcop4

This setup allows load balancing of clients among multiple Teradata nodes. That is, most Teradata systems have many nodes, and each

node has its own IP address. Without the multiple hosts file entries, every client will connect to one node and eventually this node will be

doing more client processing. With multiple host file entries, if it takes too long for the node specified with thecop1 suffix to respond

(i.e.curly_1 ) to the client request to connect top , then the client will automatically attempt to connect to the node with thecop2 suffix

(i.e.curly_2 ) and so forth.

PowerCenter and Teradata

PowerCenter accesses Teradata with several various Teradata tools. Each will be defined and how it is configured within PowerCenter.

ODBC

Teradata provides 32-bit ODBC drivers for Windows and UNIX platforms. If possible, use the ODBC driver from Teradata's TTU7 release (or

16182 https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx

1 of 15 6/2/2009 2:07 PM

above) of their client software because this version supports array reads. Tests have shown these drivers (3.02) can be 20 % - 30 % faster

than the 3.01 drivers. This release of Teradata TTU 8.0 uses ODBC 3.0421. Teradata's ODBC is on a performance par with Teradata's SQL

CLI. In fact, ODBC is Teradata recommended SQL interface for their partners.

Do not use ODBC to write to Teradata unless you're writing very small data sets (and even then, you should probably useTpump instead)

because Teradata's ODBC is optimized for query access and not optimized for writing data. So use Teradata ODBC for Teradata sources

and lookups. PowerCenter Designer uses Teradata's ODBC to import all Teradata objects (sources, lookups, targets, etc.).

ODBC WindowsConfigure the Teradata ODBC driver with the following information on Windows:

ODBC UNIX

When the PowerCenter server is running on UNIX ODBC is required to read both sources and lookups from Teradata.

As with all UNIX ODBC drivers, the key to configuring the UNIX ODBC driver is adding the appropriate entries to the.odbc.ini file. To correctly

configure the.odbc.ini file, there must be an entry under[ODBC Data Sources] that includes the Teradata ODBC driver shared library (

tdata.sl on HP-UX, standard shared library extensions on other types of UNIX).

The following example shows the required entries from an actual.odbc.ini file (note the path to the driver may be different on each computer):

[ODBC Data Sources]dBase=MERANT 3.60 dBase DriverOracle8=MERANT 3.60 Oracle 8 DriverText=MERANT 3.60 Text DriverSybase11=MERANT 3.60 Sybase 11 DriverInformix=MERANT 3.60 Informix DriverDB2=MERANT 3.60 DB2 DriverMS_SQLServer7=MERANT SQLServer driverTeraTest=tdata.sl

[TeraTest]Driver=/usr/odbc/drivers/tdata.slDescription=Teradata Test SystemDBCName=148.162.247.34


2 of 15 6/2/2009 2:07 PM

Similar to the clienthosts file setup, you can specify multiple IP addresses for theDBCName to balance the client load across multiple

Teradata nodes. Consult with a Teradata administrator for exact details on this (or copy the entries from thehosts file on the client machine.

Refer to theClient Configuration Basics section).

Important note:

Make sure that the DataDirect ODBC path precedes the Teradata ODBC path information in the PATH and SHLIB_PATH (or

LD_LIBRARY_PATH, etc.) environment variables. This is because both sets of ODBC software use some of the same file names.

PowerCenter must use the DataDirect files because this is the software that has been certified.

Teradata external loaders

PowerCenter supports four different Teradata external loaders:

Tpump

FastLoad

MultiLoad

TeradataWarehouse Builder ( TWB ).

The actual Teradata loader executables (tpump ,mload ,fastload andtbuild ) must be accessible by the PowerCenter Server application.

All of the Teradata loader connections will require a value to theTDPID attribute. Refer to the first section of this document to understand how

to correctly enter the value.

All of these loaders require:

A load file ( can be configured to be a stream/pipe and is autogenerated by PowerCenter )

A control file of commands to tell the loader what to do ( PowerCenter autogenerates)

All of these loaders will also produce a log file. This log file will be the means to debug the loader if something goes wrong. As these are

external loaders, all PowerCenter will received back from the loader is whether is ran successfully or not.

By default, the input file, control file and log file will be created in the$PMTargetFileDir directory of the PowerCenter Server executing the

workflow.


3 of 15 6/2/2009 2:07 PM

Any of these loaders can be used by the target in the PowerCenter session configured to be aFile Writer and then choose the appropriate

loader:

To override the auto-generated control file click thePencil icon next to the loader connection name:


4 of 15 6/2/2009 2:07 PM

Scroll to the bottom of the connection attribute list and click the value next to theControl File Content Override attribute. Then click the down

arrow.


5 of 15 6/2/2009 2:07 PM

Click theGenerate button and change the control file as desired. The changed control file is stored in the repository,

Most of the loaders also use some combination of internal Work, Error and Log tables. By default, these will be in the same database as the

target table. All of these can now be overridden in the attributes of the connection.


6 of 15 6/2/2009 2:07 PM

To write the input flat file that the loaders need to disk, theIs Staged attribute must be checked. If theIs Staged attribute is not set, then the file

will be piped/streamed to the loader.

If you select the non-staged mode for a loader, you should also set theCheckpoint property to '0'. This effectively turns off the checkpoint

processing. Checkpoint processing is used for recovery/restart of Fastload and Multiload sessions. However, if your are not using a physical

file as input, but rather a named pipe, then the recovery/restart mechanism of the loaders does not work. Not only does this impact

performance (i.e. the checkpoint processing is not free and we want to eliminate as much unnecessary overhead as possible), but a non-zero

checkpoint value will sometimes cause seemingly random errors and session failures when used with named pipe input (as is the case

withstreaming mode).

Teradata loader Requirements for PowerCenter servers on UNIXAll Teradata load utilities require a non-null standard output and standard error to run properly. Standard output (STDOUT) and standard

error (STDERR) are UNIX conventions that determine the default location for a program to write output and error information. When you start

the PowerCenter Server without explicitly defining STDOUT and STDERR, these both point to the current terminal session. If you logout of

UNIX, UNIX redirects STDOUT and STDERR to /dev/null (i.e. a placeholder that throws out anything written to it). At this point, Teradata

loader sessions will fail because they do not permit STDOUT and STDERR to be /dev/null. Therefore, you must start PowerCenter Server as

follows (go to the PowerCenter installation directory):

./pmserver ./pmserver.cfg > ./pmserver.out 2>&1

This starts the PowerCenter Server using thepmserver.cfg configuration file and points STDOUT and STDERR to the filepmserver.out . In

this way, STDERR and STDOUT will be defined even after the terminal session logs out.

Important note:

There are no spaces in the token2>&1 . This tells UNIX to redirect STDERR to the same place as STDOUT.

As an alternative to this method, you can specify the console output file name in the pmserver.cfg file. That is, information written to standard

output and standard error will go the file specified as follows:

ConsoleOutputFilename=

With this entry in thepmserver.cfg file, you can start the PowerCenter Server normally (i.e../pmserver ).


7 of 15 6/2/2009 2:07 PM

Partitioned LoadingWith PowerCenter, if you set a round robin partition point on the target definition and sets each target instance to be loaded using the same

loader connection instance, then PowerCenter automatically writes all data to the first partition and only starts one instance of FastLoad or

MultiLoad.

You will know you are getting this behavior if you see the following entry in the session log:

MAPPING> DBG_21684 Target [TD_INVENTORY] does not support multiple partitions.All data will berouted to the first partition.

If you do not see this message, then chances are the session fails with the following error:

WRITER_1_*_1> WRT_8240 Error: The external loader [Teradata Mload Loader] does not supportpartitioned sessions.WRITER_1_*_1> WRT_8068 Writer initialization failed. Writer terminating.

Tpump

Tpump is an external loader that supports inserts, updates, upserts and deletes and data driven updates. Multiples Tpump's can execute

simultaneously against the same table as it does not use many resource nor does it require table level locks. It is often used to trickle load a

table. As stated earlier, it will be a faster way to update a table as opposed to ODBC, but will not be as fast as the other loaders.

MultiLoadThis is a sophisticated bulk load utility and is the primary method PowerCenter uses to load/update mass quantities of data into Teradata.

Unlike bulk load utilities from other vendors, MultiLoad supports inserts, updates, upserts, delete and data driven operations in PowerCenter.

You can also use variables and embed conditional logic into MultiLoad scripts. It is very fast (millions of rows in a few minutes). It can be

resource intensive and will take a table lock.


8 of 15 6/2/2009 2:07 PM

Cleaning up after a failed MultiLoad

MultiLoad puts the target table into the MultiLoad state. Upon successful completion, the target table is returned to theNormal

(non-MultiLoad) state. Therefore, when a MultiLoad fails for any reason, the table is left in MultiLoad state, and you cannot simply re-run the

same MultiLoad. MultiLoad will report an error. In addition, MultiLoad also queries the target table's MultiLoad log table to see if it contains

any errors. If a MultiLoad log table exists for the target table, then you also will not be able to rerun your MultiLoad job.

To recover from a failed MultiLoad, you must release the target table from the MultiLoad state and also drop the MultiLoad log table. You can

do this using BTEQ or QueryMan to issue the following commands:

drop table mldlog_<table name>;

release mload <table name>;

Note:

The drop table command assumes that you are recovering from a MultiLoad script generated by PowerCenter (PowerCenter always names

the MultiLoad log tablemldlog_<table name> ). If you're working with a hand-coded MultiLoad script, the name of the MultiLoad log table

could be anything.

Here is the actual text from a BTEQ session which cleans up a failed load to the tabletd_test owned by the userinfatest :

BTEQ -- Enter your DBC/SQL request or BTEQ command:drop table infatest.mldlog_td_test;drop table infatest.mldlog_td_test;*** Table has been dropped.*** Total elapsed time was 1 second.BTEQ -- Enter your DBC/SQL request or BTEQ command:release mload infatest.td_test;release mload infatest.td_test;*** Mload has been released.


9 of 15 6/2/2009 2:07 PM

*** Total elapsed time was 1 second.

Using one instance of MultiLoad to load multiple tables

MultiLoad can require a large amount of resources on a Teradata system. Some systems will have hard limits on the number of concurrent

MultiLoad sessions allowed. By default, PowerCenter will start an instance of MultiLoad for every target file. To use a single instance of

MultiLoad to load multiple tables (or to load both inserts and updates into the same target table) the generated MultiLoad script file must be

editted.

Note:

This should not be an issue with Tpump because Tpump is not as resource intensive as MultiLoad (and a multiple concurrent instances of

Tpump can target the same table).

Here's a workaround:

Use a dummy session (i.e. set test rows to 1 and target a test database) to generate MultiLoad control files for each of the targets.1.

Merge the multiple control files (one per target table) into a single control file (one for all target tables).2.

Configure the session to call MultiLoad from a post-session script using the control file created in step (2).

Integrated support cannot be used because each input file is processed sequentially and this causes problems when combined with

PowerCenter's integrated named pipes and streaming.

3.

Details on merging the control files:

There is a single log file for each instance of MultiLoad. Therefore, you do not have to change or add anything theLOGFILE

statement. However, you might want to change the name of the log table since it may be a log that spans multiple tables.

1.

Copy the work and error table delete statements into the common control file.2.

Modify theBEGIN MLOAD statement to specify all the tables that the MultiLoad will be hitting.3.

Copy theLayout sections into the common control file and give each a unique name.Organize the file such that all the layout sections

are grouped together.

4.

Copy theDML sections into the common control file and give each a unique name.Organize the file such that all the DML sections are

grouped together.

5.

Copy theImport statements into the common control file and modify them to reflect the unique names created for the referenced

LAYOUT and DML sections created in steps 4) and 5). Organize the file such that all the Import sections are grouped together.

6.

Runchmod -w on the newly created control file so PowerCenter does not overwrite it, or, name it something different so PowerCenter

cannot overwrite it.

7.

Note:

Asingle instance of MultiLoad can target at most 5 tables.

Therefore, do not combine more than 5 target files into a common file.

Example:

Here's an example of a control file merged from two default control files:

.DATEFORM ANSIDATE;

.LOGON demo1099/infatest,infatest;

.LOGTABLE infatest.mldlog_TD_TEST;DROP TABLE infatest.UV_TD_TEST ;DROP TABLE infatest.WT_TD_TEST ;DROP TABLE infatest.ET_TD_TEST ;DROP TABLE infatest.UV_TD_CUSTOMERS ;DROP TABLE infatest.WT_TD_CUSTOMERS ;DROP TABLE infatest.ET_TD_CUSTOMERS ;.ROUTE MESSAGES WITH ECHO TO FILEc:\LOGS\TgtFiles\td_test.out.ldrlog ;.BEGIN IMPORT MLOAD TABLES infatest.TD_TEST, infatest.TD_CUSTOMERSERRLIMIT1CHECKPOINT10000TENACITY 10000SESSIONS 1SLEEP 6;


10 of 15 6/2/2009 2:07 PM

/* Begin Layout Section */.Layout InputFileLayout1;.Field CUST_KEY1 CHAR(12)NULLIF CUST_KEY= '*' ;.Field CUST_NAME13 CHAR(20)NULLIF CUST_NAME = '*' ;.Field CUST_DATE33 CHAR(10)NULLIF CUST_DATE = '*' ;.Field CUST_DATEmm33 CHAR(2);.Field CUST_DATEdd36 CHAR(2);.Field CUST_DATEyyyy39 CHAR(4);.Field CUST_DATEtd CUST_DATEyyyy||'/'||CUST_DATEmm||'/'||CUST_DATEdd NULLIFCUST_DATE= '*' ;.Filler EOL_PAD43 CHAR(2) ;.Layout InputFileLayout2;.Field CUSTOMER_KEY1 CHAR(12);.Field CUSTOMER_ID13 CHAR(12);.Field COMPANY25 CHAR(50)NULLIF COMPANY= '*' ;.Field FIRST_NAME75 CHAR(30)NULLIF FIRST_NAME= '*' ;.Field LAST_NAME105 CHAR(30)NULLIF LAST_NAME= '*' ;.Field ADDRESS1135 CHAR(72)NULLIF ADDRESS1= '*' ;.Field ADDRESS2207 CHAR(72)NULLIF ADDRESS2= '*' ;.Field CITY279 CHAR(30)NULLIFCITY= '*' ;.Field STATE309 CHAR(2)NULLIF STATE= '*' ;.Field POSTAL_CODE311 CHAR(10)NULLIF POSTAL_CODE = '*' ;.Field PHONE321 CHAR(30)NULLIF PHONE= '*' ;.Field EMAIL351 CHAR(30)NULLIF EMAIL= '*' ;.Field REC_STATUS381 CHAR(1)NULLIF REC_STATUS= '*' ;.Filler EOL_PAD382 CHAR(2) ;

/* End Layout Section */

/* begin DML Section */

.DML Label tagDML1;INSERT INTO infatest.TD_TEST (CUST_KEY,CUST_NAME,CUST_DATE) VALUES (:CUST_KEY,:CUST_NAME,:CUST_DATEtd) ;.DML Label tagDML2;INSERT INTO infatest.TD_CUSTOMERS (CUSTOMER_KEY,CUSTOMER_ID,COMPANY,FIRST_NAME,LAST_NAME,ADDRESS1,ADDRESS2,CITY,STATE,POSTAL_CODE,PHONE,EMAIL,REC_STATUS) VALUES (:CUSTOMER_KEY,:CUSTOMER_ID,:COMPANY,:FIRST_NAME,:LAST_NAME,:ADDRESS1,:ADDRESS2,:CITY,:STATE,:POSTAL_CODE,:PHONE,:EMAIL,:REC_STATUS) ;

/* end DML Section */

/* Begin Import Section */


11 of 15 6/2/2009 2:07 PM

.Import Infile c:\LOGS\TgtFiles\td_test.outLayout InputFileLayout1Format UnformatApply tagDML1;

.Import Infile c:\LOGS\TgtFiles\td_customers.outLayout InputFileLayout2Format UnformatApply tagDML2;

/* End Import Section */

.END MLOAD;

.LOGOFF;

FastLoad

As the name suggests, this is a very fast utility to load data into Teradata. It is the fastest method to load data into Teradata. However, there

is one major restriction: the target table must be empty.

Teradata Warehouse Builder (TWB)Teradata Warehouse Builder (TWB) is a single utility that was intended to replace FastLoad, MultiLoad, Tpump and FastExport. It was to

support a single scripting environment with different modes, where each mode roughly equates to one of the legacy utilities. It also was to

support parallel loading (i.e. multiple instances of a TWB client could run and load the same table at the same time - something the legacy

loaders cannot do). Unfortunately, NCR/Teradata does not support TWB and TWB has never been formally released. According to NCR, the

release was delayed primarily because of issues with the mainframe version.


12 of 15 6/2/2009 2:07 PM

Defining primary keys for tables to support updates, upserts and deletesLike any other database technology, primary keys have to be specified for Teradata tables when doing updates, upserts or deletes using

Teradata loaders. Sometimes, however, there are no primary keys defined for the underlying Teradata tables. In this case, primary keys have

to be defined in the metadata when target table definitions are imported using Warehouse designer. The list of primary keys to be used in this

definition can be obtained from the Teradata DBA.

If a table has a partition defined then the key(s) on which the partition has been defined should also be marked as a primary key(s) when

defining the target table. This can be obtained either from the DBA or from looking at the table definition scripts.

In the example below, theCustomer_ID ,Customer_Name and Effective_Date fields have been marked as primary keys even though these

are not primary keys in the underlying Teradata table.


13 of 15 6/2/2009 2:07 PM

Note:

If the primary keys are not defined then any attempt to update, upsert or delete data using Teradata loaders will result in an error.

More Information

Reference

Teradata documentation http://www.info.ncr.com/Teradata/eTeradata-BrowseBy.cfm

Teradata Forum http://www.teradataforum.com

Related Documents

Teradata Frequently Asked Questions

FAQ: What versions of Teradata does PowerCenter support?

Attachments

Applies To


14 of 15 6/2/2009 2:07 PM

Database

:Teradata

Operating

Systems

:

Other

Software

:

Product : PowerCenter

Last Modified Date: 4/10/2009 11:46 PM ID: 16182


15 of 15 6/2/2009 2:07 PM

Best Practices Using Teradata With PowerCenter

Documents

Transcript of Best Practices Using Teradata With PowerCenter