Best Practices Using Teradata With PowerCenter
-
Upload
arun-choudhary -
Category
Documents
-
view
35 -
download
0
description
Transcript of Best Practices Using Teradata With PowerCenter
Best practices using Teradata with PowerCenter
Content
This document discusses how to use Teradata with PowerCenter. It covers Teradata basics and also describes some tweaks which may be
necessary to adequately deal with some of the common practices.
Teradata Basics
Teradata is a relational database management system from NCR. It offers high performance for very large databases using a highly parallel
architecture. While Teradata can run on other platforms, it is predominantly found on NCR hardware (which runs NCR's version of UNIX). It is
very fast and very scalable.
Teradata HardwareThe NCR computers on which Teradata runs support both MPP (Massively Parallel Processing) and SMP (Symmetric Multi-Processing).
Each MPPnode (or semi-autonomous processing unit) can support SMP.
Teradata can be configured to communicate directly with a mainframe's I/O channel. This is known aschannel attached . Alternatively, it can
benetwork attached . That is, configured to communicate via TCP/IP over a LAN.
Channel attached is not always faster thannetwork attached . Similar performance has been observed across a channel attachment as well
as a 100 MB LAN. In addition,channel attached requires an additional sequential data move because the data must be moved from the
PowerCenter server to the mainframe prior to moving the data across the mainframe channel to Teradata.
Teradata SoftwareThere are Teradata Director Program Ids (TDPIDs), databases and users. The TDPID is the name that is used to connect from a Teradata
client to a Teradata server (similar to Oracle thetnsnames.ora entry). Teradata databases and users are somewhat synonymous. A user has
a userid, password and space to store tables. A database is basically a user without a login and password (or a user is a database with a
userid and password).
Teradata Access Module Processors (AMPs) are Teradata's parallel database engines. Although they are strictly software ('virtual
processors' according to NCR terminology), Teradata refers to AMP and hardware node interchangeably since in the past an AMP was a
piece of hardware.
Client Configuration Basics for Teradata
The client side configuration is done using thehosts file (/etc/hosts on UNIX or winnt\system32\drivers\etc\hosts on Windows).
In thehosts file the name of the Teradata instance (i.e.tdpid - Teradata Director Program Id) is indicated by the letters and numbers that
precede the stringcop1 in a hosts file entry.
Example:
127.0.0.1 localhost demo1099cop1192.168.80.113 curly pcop1
This tells Teradata that when a client tool references the instancedemo1099 , it should direct requests tolocalhost (or IP address 127.0.0.1),
when a client tool references instance p, this located on the servercurly (or IP address 192.168.80.113). This entry does not contain any kind
of database server specific information (the TDPID is not the same as an Oracle instance ID). That is, the TDPID is used strictly to define the
name a client uses to connect to a server. Teradata takes the name you specify, looks in thehosts file to map the cop1 (orcop2 , etc.) to an IP
address, and then attempts to establish a connect with Teradata at the IP address.
There can be multiple entries in a hosts file with similar TDPIDs:
127.0.0.1 localhost demo1099cop1192.168.80.113 curly_1 pcop1192.168.80.114 curly_2 pcop2192.168.80.115 curly_3 pcop3192.168.80.116 curly_4 pcop4
This setup allows load balancing of clients among multiple Teradata nodes. That is, most Teradata systems have many nodes, and each
node has its own IP address. Without the multiple hosts file entries, every client will connect to one node and eventually this node will be
doing more client processing. With multiple host file entries, if it takes too long for the node specified with thecop1 suffix to respond
(i.e.curly_1 ) to the client request to connect top , then the client will automatically attempt to connect to the node with thecop2 suffix
(i.e.curly_2 ) and so forth.
PowerCenter and Teradata
PowerCenter accesses Teradata with several various Teradata tools. Each will be defined and how it is configured within PowerCenter.
ODBC
Teradata provides 32-bit ODBC drivers for Windows and UNIX platforms. If possible, use the ODBC driver from Teradata's TTU7 release (or
16182 https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx
1 of 15 6/2/2009 2:07 PM
above) of their client software because this version supports array reads. Tests have shown these drivers (3.02) can be 20 % - 30 % faster
than the 3.01 drivers. This release of Teradata TTU 8.0 uses ODBC 3.0421. Teradata's ODBC is on a performance par with Teradata's SQL
CLI. In fact, ODBC is Teradata recommended SQL interface for their partners.
Do not use ODBC to write to Teradata unless you're writing very small data sets (and even then, you should probably useTpump instead)
because Teradata's ODBC is optimized for query access and not optimized for writing data. So use Teradata ODBC for Teradata sources
and lookups. PowerCenter Designer uses Teradata's ODBC to import all Teradata objects (sources, lookups, targets, etc.).
ODBC WindowsConfigure the Teradata ODBC driver with the following information on Windows:
ODBC UNIX
When the PowerCenter server is running on UNIX ODBC is required to read both sources and lookups from Teradata.
As with all UNIX ODBC drivers, the key to configuring the UNIX ODBC driver is adding the appropriate entries to the.odbc.ini file. To correctly
configure the.odbc.ini file, there must be an entry under[ODBC Data Sources] that includes the Teradata ODBC driver shared library (
tdata.sl on HP-UX, standard shared library extensions on other types of UNIX).
The following example shows the required entries from an actual.odbc.ini file (note the path to the driver may be different on each computer):
[ODBC Data Sources]dBase=MERANT 3.60 dBase DriverOracle8=MERANT 3.60 Oracle 8 DriverText=MERANT 3.60 Text DriverSybase11=MERANT 3.60 Sybase 11 DriverInformix=MERANT 3.60 Informix DriverDB2=MERANT 3.60 DB2 DriverMS_SQLServer7=MERANT SQLServer driverTeraTest=tdata.sl
[TeraTest]Driver=/usr/odbc/drivers/tdata.slDescription=Teradata Test SystemDBCName=148.162.247.34
16182 https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx
2 of 15 6/2/2009 2:07 PM
Similar to the clienthosts file setup, you can specify multiple IP addresses for theDBCName to balance the client load across multiple
Teradata nodes. Consult with a Teradata administrator for exact details on this (or copy the entries from thehosts file on the client machine.
Refer to theClient Configuration Basics section).
Important note:
Make sure that the DataDirect ODBC path precedes the Teradata ODBC path information in the PATH and SHLIB_PATH (or
LD_LIBRARY_PATH, etc.) environment variables. This is because both sets of ODBC software use some of the same file names.
PowerCenter must use the DataDirect files because this is the software that has been certified.
Teradata external loaders
PowerCenter supports four different Teradata external loaders:
Tpump
FastLoad
MultiLoad
TeradataWarehouse Builder ( TWB ).
The actual Teradata loader executables (tpump ,mload ,fastload andtbuild ) must be accessible by the PowerCenter Server application.
All of the Teradata loader connections will require a value to theTDPID attribute. Refer to the first section of this document to understand how
to correctly enter the value.
All of these loaders require:
A load file ( can be configured to be a stream/pipe and is autogenerated by PowerCenter )
A control file of commands to tell the loader what to do ( PowerCenter autogenerates)
All of these loaders will also produce a log file. This log file will be the means to debug the loader if something goes wrong. As these are
external loaders, all PowerCenter will received back from the loader is whether is ran successfully or not.
By default, the input file, control file and log file will be created in the$PMTargetFileDir directory of the PowerCenter Server executing the
workflow.
16182 https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx
3 of 15 6/2/2009 2:07 PM
Any of these loaders can be used by the target in the PowerCenter session configured to be aFile Writer and then choose the appropriate
loader:
To override the auto-generated control file click thePencil icon next to the loader connection name:
16182 https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx
4 of 15 6/2/2009 2:07 PM
Scroll to the bottom of the connection attribute list and click the value next to theControl File Content Override attribute. Then click the down
arrow.
16182 https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx
5 of 15 6/2/2009 2:07 PM
Click theGenerate button and change the control file as desired. The changed control file is stored in the repository,
Most of the loaders also use some combination of internal Work, Error and Log tables. By default, these will be in the same database as the
target table. All of these can now be overridden in the attributes of the connection.
16182 https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx
6 of 15 6/2/2009 2:07 PM
To write the input flat file that the loaders need to disk, theIs Staged attribute must be checked. If theIs Staged attribute is not set, then the file
will be piped/streamed to the loader.
If you select the non-staged mode for a loader, you should also set theCheckpoint property to '0'. This effectively turns off the checkpoint
processing. Checkpoint processing is used for recovery/restart of Fastload and Multiload sessions. However, if your are not using a physical
file as input, but rather a named pipe, then the recovery/restart mechanism of the loaders does not work. Not only does this impact
performance (i.e. the checkpoint processing is not free and we want to eliminate as much unnecessary overhead as possible), but a non-zero
checkpoint value will sometimes cause seemingly random errors and session failures when used with named pipe input (as is the case
withstreaming mode).
Teradata loader Requirements for PowerCenter servers on UNIXAll Teradata load utilities require a non-null standard output and standard error to run properly. Standard output (STDOUT) and standard
error (STDERR) are UNIX conventions that determine the default location for a program to write output and error information. When you start
the PowerCenter Server without explicitly defining STDOUT and STDERR, these both point to the current terminal session. If you logout of
UNIX, UNIX redirects STDOUT and STDERR to /dev/null (i.e. a placeholder that throws out anything written to it). At this point, Teradata
loader sessions will fail because they do not permit STDOUT and STDERR to be /dev/null. Therefore, you must start PowerCenter Server as
follows (go to the PowerCenter installation directory):
./pmserver ./pmserver.cfg > ./pmserver.out 2>&1
This starts the PowerCenter Server using thepmserver.cfg configuration file and points STDOUT and STDERR to the filepmserver.out . In
this way, STDERR and STDOUT will be defined even after the terminal session logs out.
Important note:
There are no spaces in the token2>&1 . This tells UNIX to redirect STDERR to the same place as STDOUT.
As an alternative to this method, you can specify the console output file name in the pmserver.cfg file. That is, information written to standard
output and standard error will go the file specified as follows:
ConsoleOutputFilename=
With this entry in thepmserver.cfg file, you can start the PowerCenter Server normally (i.e../pmserver ).
16182 https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx
7 of 15 6/2/2009 2:07 PM
Partitioned LoadingWith PowerCenter, if you set a round robin partition point on the target definition and sets each target instance to be loaded using the same
loader connection instance, then PowerCenter automatically writes all data to the first partition and only starts one instance of FastLoad or
MultiLoad.
You will know you are getting this behavior if you see the following entry in the session log:
MAPPING> DBG_21684 Target [TD_INVENTORY] does not support multiple partitions.All data will berouted to the first partition.
If you do not see this message, then chances are the session fails with the following error:
WRITER_1_*_1> WRT_8240 Error: The external loader [Teradata Mload Loader] does not supportpartitioned sessions.WRITER_1_*_1> WRT_8068 Writer initialization failed. Writer terminating.
Tpump
Tpump is an external loader that supports inserts, updates, upserts and deletes and data driven updates. Multiples Tpump's can execute
simultaneously against the same table as it does not use many resource nor does it require table level locks. It is often used to trickle load a
table. As stated earlier, it will be a faster way to update a table as opposed to ODBC, but will not be as fast as the other loaders.
MultiLoadThis is a sophisticated bulk load utility and is the primary method PowerCenter uses to load/update mass quantities of data into Teradata.
Unlike bulk load utilities from other vendors, MultiLoad supports inserts, updates, upserts, delete and data driven operations in PowerCenter.
You can also use variables and embed conditional logic into MultiLoad scripts. It is very fast (millions of rows in a few minutes). It can be
resource intensive and will take a table lock.
16182 https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx
8 of 15 6/2/2009 2:07 PM
Cleaning up after a failed MultiLoad
MultiLoad puts the target table into the MultiLoad state. Upon successful completion, the target table is returned to theNormal
(non-MultiLoad) state. Therefore, when a MultiLoad fails for any reason, the table is left in MultiLoad state, and you cannot simply re-run the
same MultiLoad. MultiLoad will report an error. In addition, MultiLoad also queries the target table's MultiLoad log table to see if it contains
any errors. If a MultiLoad log table exists for the target table, then you also will not be able to rerun your MultiLoad job.
To recover from a failed MultiLoad, you must release the target table from the MultiLoad state and also drop the MultiLoad log table. You can
do this using BTEQ or QueryMan to issue the following commands:
drop table mldlog_<table name>;
release mload <table name>;
Note:
The drop table command assumes that you are recovering from a MultiLoad script generated by PowerCenter (PowerCenter always names
the MultiLoad log tablemldlog_<table name> ). If you're working with a hand-coded MultiLoad script, the name of the MultiLoad log table
could be anything.
Here is the actual text from a BTEQ session which cleans up a failed load to the tabletd_test owned by the userinfatest :
BTEQ -- Enter your DBC/SQL request or BTEQ command:drop table infatest.mldlog_td_test;drop table infatest.mldlog_td_test;*** Table has been dropped.*** Total elapsed time was 1 second.BTEQ -- Enter your DBC/SQL request or BTEQ command:release mload infatest.td_test;release mload infatest.td_test;*** Mload has been released.
16182 https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx
9 of 15 6/2/2009 2:07 PM
*** Total elapsed time was 1 second.
Using one instance of MultiLoad to load multiple tables
MultiLoad can require a large amount of resources on a Teradata system. Some systems will have hard limits on the number of concurrent
MultiLoad sessions allowed. By default, PowerCenter will start an instance of MultiLoad for every target file. To use a single instance of
MultiLoad to load multiple tables (or to load both inserts and updates into the same target table) the generated MultiLoad script file must be
editted.
Note:
This should not be an issue with Tpump because Tpump is not as resource intensive as MultiLoad (and a multiple concurrent instances of
Tpump can target the same table).
Here's a workaround:
Use a dummy session (i.e. set test rows to 1 and target a test database) to generate MultiLoad control files for each of the targets.1.
Merge the multiple control files (one per target table) into a single control file (one for all target tables).2.
Configure the session to call MultiLoad from a post-session script using the control file created in step (2).
Integrated support cannot be used because each input file is processed sequentially and this causes problems when combined with
PowerCenter's integrated named pipes and streaming.
3.
Details on merging the control files:
There is a single log file for each instance of MultiLoad. Therefore, you do not have to change or add anything theLOGFILE
statement. However, you might want to change the name of the log table since it may be a log that spans multiple tables.
1.
Copy the work and error table delete statements into the common control file.2.
Modify theBEGIN MLOAD statement to specify all the tables that the MultiLoad will be hitting.3.
Copy theLayout sections into the common control file and give each a unique name.Organize the file such that all the layout sections
are grouped together.
4.
Copy theDML sections into the common control file and give each a unique name.Organize the file such that all the DML sections are
grouped together.
5.
Copy theImport statements into the common control file and modify them to reflect the unique names created for the referenced
LAYOUT and DML sections created in steps 4) and 5). Organize the file such that all the Import sections are grouped together.
6.
Runchmod -w on the newly created control file so PowerCenter does not overwrite it, or, name it something different so PowerCenter
cannot overwrite it.
7.
Note:
Asingle instance of MultiLoad can target at most 5 tables.
Therefore, do not combine more than 5 target files into a common file.
Example:
Here's an example of a control file merged from two default control files:
.DATEFORM ANSIDATE;
.LOGON demo1099/infatest,infatest;
.LOGTABLE infatest.mldlog_TD_TEST;DROP TABLE infatest.UV_TD_TEST ;DROP TABLE infatest.WT_TD_TEST ;DROP TABLE infatest.ET_TD_TEST ;DROP TABLE infatest.UV_TD_CUSTOMERS ;DROP TABLE infatest.WT_TD_CUSTOMERS ;DROP TABLE infatest.ET_TD_CUSTOMERS ;.ROUTE MESSAGES WITH ECHO TO FILEc:\LOGS\TgtFiles\td_test.out.ldrlog ;.BEGIN IMPORT MLOAD TABLES infatest.TD_TEST, infatest.TD_CUSTOMERSERRLIMIT1CHECKPOINT10000TENACITY 10000SESSIONS 1SLEEP 6;
16182 https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx
10 of 15 6/2/2009 2:07 PM
/* Begin Layout Section */.Layout InputFileLayout1;.Field CUST_KEY1 CHAR(12)NULLIF CUST_KEY= '*' ;.Field CUST_NAME13 CHAR(20)NULLIF CUST_NAME = '*' ;.Field CUST_DATE33 CHAR(10)NULLIF CUST_DATE = '*' ;.Field CUST_DATEmm33 CHAR(2);.Field CUST_DATEdd36 CHAR(2);.Field CUST_DATEyyyy39 CHAR(4);.Field CUST_DATEtd CUST_DATEyyyy||'/'||CUST_DATEmm||'/'||CUST_DATEdd NULLIFCUST_DATE= '*' ;.Filler EOL_PAD43 CHAR(2) ;.Layout InputFileLayout2;.Field CUSTOMER_KEY1 CHAR(12);.Field CUSTOMER_ID13 CHAR(12);.Field COMPANY25 CHAR(50)NULLIF COMPANY= '*' ;.Field FIRST_NAME75 CHAR(30)NULLIF FIRST_NAME= '*' ;.Field LAST_NAME105 CHAR(30)NULLIF LAST_NAME= '*' ;.Field ADDRESS1135 CHAR(72)NULLIF ADDRESS1= '*' ;.Field ADDRESS2207 CHAR(72)NULLIF ADDRESS2= '*' ;.Field CITY279 CHAR(30)NULLIFCITY= '*' ;.Field STATE309 CHAR(2)NULLIF STATE= '*' ;.Field POSTAL_CODE311 CHAR(10)NULLIF POSTAL_CODE = '*' ;.Field PHONE321 CHAR(30)NULLIF PHONE= '*' ;.Field EMAIL351 CHAR(30)NULLIF EMAIL= '*' ;.Field REC_STATUS381 CHAR(1)NULLIF REC_STATUS= '*' ;.Filler EOL_PAD382 CHAR(2) ;
/* End Layout Section */
/* begin DML Section */
.DML Label tagDML1;INSERT INTO infatest.TD_TEST (CUST_KEY,CUST_NAME,CUST_DATE) VALUES (:CUST_KEY,:CUST_NAME,:CUST_DATEtd) ;.DML Label tagDML2;INSERT INTO infatest.TD_CUSTOMERS (CUSTOMER_KEY,CUSTOMER_ID,COMPANY,FIRST_NAME,LAST_NAME,ADDRESS1,ADDRESS2,CITY,STATE,POSTAL_CODE,PHONE,EMAIL,REC_STATUS) VALUES (:CUSTOMER_KEY,:CUSTOMER_ID,:COMPANY,:FIRST_NAME,:LAST_NAME,:ADDRESS1,:ADDRESS2,:CITY,:STATE,:POSTAL_CODE,:PHONE,:EMAIL,:REC_STATUS) ;
/* end DML Section */
/* Begin Import Section */
16182 https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx
11 of 15 6/2/2009 2:07 PM
.Import Infile c:\LOGS\TgtFiles\td_test.outLayout InputFileLayout1Format UnformatApply tagDML1;
.Import Infile c:\LOGS\TgtFiles\td_customers.outLayout InputFileLayout2Format UnformatApply tagDML2;
/* End Import Section */
.END MLOAD;
.LOGOFF;
FastLoad
As the name suggests, this is a very fast utility to load data into Teradata. It is the fastest method to load data into Teradata. However, there
is one major restriction: the target table must be empty.
Teradata Warehouse Builder (TWB)Teradata Warehouse Builder (TWB) is a single utility that was intended to replace FastLoad, MultiLoad, Tpump and FastExport. It was to
support a single scripting environment with different modes, where each mode roughly equates to one of the legacy utilities. It also was to
support parallel loading (i.e. multiple instances of a TWB client could run and load the same table at the same time - something the legacy
loaders cannot do). Unfortunately, NCR/Teradata does not support TWB and TWB has never been formally released. According to NCR, the
release was delayed primarily because of issues with the mainframe version.
16182 https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx
12 of 15 6/2/2009 2:07 PM
Defining primary keys for tables to support updates, upserts and deletesLike any other database technology, primary keys have to be specified for Teradata tables when doing updates, upserts or deletes using
Teradata loaders. Sometimes, however, there are no primary keys defined for the underlying Teradata tables. In this case, primary keys have
to be defined in the metadata when target table definitions are imported using Warehouse designer. The list of primary keys to be used in this
definition can be obtained from the Teradata DBA.
If a table has a partition defined then the key(s) on which the partition has been defined should also be marked as a primary key(s) when
defining the target table. This can be obtained either from the DBA or from looking at the table definition scripts.
In the example below, theCustomer_ID ,Customer_Name and Effective_Date fields have been marked as primary keys even though these
are not primary keys in the underlying Teradata table.
16182 https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx
13 of 15 6/2/2009 2:07 PM
Note:
If the primary keys are not defined then any attempt to update, upsert or delete data using Teradata loaders will result in an error.
More Information
Reference
Teradata documentation http://www.info.ncr.com/Teradata/eTeradata-BrowseBy.cfm
Teradata Forum http://www.teradataforum.com
Related Documents
Teradata Frequently Asked Questions
FAQ: What versions of Teradata does PowerCenter support?
Attachments
Applies To
16182 https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx
14 of 15 6/2/2009 2:07 PM
Database
:Teradata
Operating
Systems
:
Other
Software
:
Product : PowerCenter
Last Modified Date: 4/10/2009 11:46 PM ID: 16182
16182 https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx
15 of 15 6/2/2009 2:07 PM