Teradata Internals

24
Primary Index 1. The key to distribution of data in Teradata is PI. It determines where a row will reside 2. PI provides the fastest physical path to retrieve the data and is incredibly important to joins 3. selection of proper PI avoids data storage Skewness 4. Teradata can hash two very different values and the result can sometimes be the same Row Hash. This is a called a Collision. It is sometimes called a Synonym Criteria to select Primary Index column for a given table 1. Identify index candidates that maximize one-AMP operations. 2. Columns most frequently used for access (Value and Join). 3. Identify index candidates that optimize parallel processing. 4. Columns that provide good distribution. Unique Primary Index (UPI) 1. Is unique and can't have duplicates. Duplicate rows will be rejected and doesn't require duplicate checking 2. Is always one AMP operation Non-Unique Primary Index (NUPI) 1. Values for the selected column can be non-unique. 2. Use When the NUPI column may be more effective for query access and joins. Skew Factor The data distribution of table among AMPs is called Skew Factor. Generally For Non-Unique PI we get duplicate values so the more duplicate vales we get more the data have same row hash so all the same data will come to same amp, it makes data distribution inequality, One amp will store more data and other amp stores less amount of data, when we are accessing full table, The amp which is having more data will take longer time and makes other amps waiting which leads processing wastage Hashing 1. Hashing is a mathematical process where an Index (UPI, NUPI) is converted into a 32-bit row hash value. 2. Teradata takes that Primary Index value and runs it through a Hashing Algorithm. The output of the Hashing Algorithm is a 32-bit Row Hash. 3. The 32-bit Row Hash will point to a certain spot on the Hash Map, which will indicate which AMP will hold the row. This 32-bit Row Hash will always remain with the Row as part of a Row Identifier (Row ID). 4. The first 16 bits of the Row Hash (Destination Selection Word) are used to locate an entry in the Hash Map. This entry is called a Hash Map Bucket. The only thing that resides inside a Hash Map Bucket is the AMP number where the row will reside.The row along with the Row Hash are delivered to that AMP 5. The AMP will then assign a Uniqueness Value to the Row Hash It assigns a 1 if the Row Hash is unique or a 2 if it is the second or a 3 if the third, etc. 6. The 32-bit row hash and the 32-bit uniqueness value make up the 64-bit Row ID. The Row ID is how tables are sorted on an AMP HASH FUNCTIONS HASHROW : returns the row hash value for a given value HASHBUCKET : the grouping of a specific hash value HASHAMP : the AMP that is associated with the hash bucket HASHBAKAMP :the fallback AMP that is associated with the hash bucket SELECTHASHROW ('Teradata') AS "Hash Value" , HASHBUCKET (HASHROW ('Teradata')) AS "Bucket Num" , HASHAMP (HASHBUCKET (HASHROW ('Teradata'))) AS "AMP Num" , HASHBAKAMP (HASHBUCKET (HASHROW ('Teradata'))) AS "AMP Fallback Num" ; Binary Search

description

Teradata Internals

Transcript of Teradata Internals

Primary Index1. The key to distribution of data in Teradata is PI. It determines where a row will reside2. PI provides the fastest physical path to retrieve the data and is incredibly important to joins3. selection of proper PI avoids data storage Skewness4. Teradata can hash two very different values and the result can sometimes be the same Row Hash. This is a called a Collision. It is sometimes called a Synonym

Criteria to select Primary Index column for a given table1. Identify index candidates that maximize one-AMP operations.2. Columns most frequently used for access (Value and Join).3. Identify index candidates that optimize parallel processing.4. Columns that provide good distribution.

Unique Primary Index (UPI) 1. Is unique and can't have duplicates. Duplicate rows will be rejected and doesn't require duplicate checking2. Is always one AMP operation Non-Unique Primary Index (NUPI)1. Values for the selected column can be non-unique. 2. Use When the NUPI column may be more effective for query access and joins.

Skew FactorThe data distribution of table among AMPs is called Skew Factor. Generally For Non-Unique PI we get duplicate values so the more duplicate vales we get more the data have same row hash so all the same data will come to same amp, it makes data distribution inequality, One amp will store more data and other amp stores less amount of data, when we are accessing full table, The amp which is having more data will take longer time and makes other amps waiting which leads processing wastage

Hashing1. Hashing is a mathematical process where an Index (UPI, NUPI) is converted into a 32-bit row hash value.2. Teradata takes that Primary Index value and runs it through a Hashing Algorithm. The output of the Hashing Algorithm is a 32-bit Row Hash.3. The 32-bit Row Hash will point to a certain spot on the Hash Map, which will indicate which AMP will hold the row. This 32-bit Row Hash will always remain with the Row as part of a Row Identifier (Row ID).4. The first 16 bits of the Row Hash (Destination Selection Word) are used to locate an entry in the Hash Map. This entry is called a Hash Map Bucket. The only thing that resides inside a Hash Map Bucket is the AMP number where the row will reside.The row along with the Row Hash are delivered to that AMP5. The AMP will then assign a Uniqueness Value to the Row Hash It assigns a 1 if the Row Hash is unique or a 2 if it is the second or a 3 if the third, etc.6. The 32-bit row hash and the 32-bit uniqueness value make up the 64-bit Row ID. The Row ID is how tables are sorted on an AMP

HASH FUNCTIONSHASHROW : returns the row hash value for a given valueHASHBUCKET : the grouping of a specific hash valueHASHAMP : the AMP that is associated with the hash bucketHASHBAKAMP :the fallback AMP that is associated with the hash bucket

SELECTHASHROW ('Teradata') AS "Hash Value", HASHBUCKET (HASHROW ('Teradata')) AS "Bucket Num", HASHAMP (HASHBUCKET (HASHROW ('Teradata'))) AS "AMP Num", HASHBAKAMP (HASHBUCKET (HASHROW ('Teradata'))) AS "AMP Fallback Num" ;

Binary Search1. When an AMP searches for a row using a Primary Index the AMP can perform a Binary Search each table is sorted by the2. Primary Index Row-ID and all Row-IDs are made up of zeros and ones.The AMP can go to themiddle of the rows and pick a row. The system will say either "Too high", "Too low" or "Got it"3. If the system says "Too Low" or "Too High" then the AMP will go halfway up or down the file andcheck again till it finds the row

Partition Primary Index ( PPI) Data in the PPI is always distributed by PI column and then partitioned by the PPI column and then sorted by row hash. PPI are best for queries that specifies range constraints Partition column can be different then PI column. But You can NOT have a UNIQUE PRIMARY INDEX on a table that is partitioned by somethingnot included in the Primary Index PPI reduce the number of rows to be processed by using partition elimination.The process of accessing chunks of data along the partitioning attributes is often referred to as partition elimination. PPI avoids full table scans without the overhead of a secondary index and allows for instantaneous dropping of old data and rapid addition of newer data Partitioningdoesn'taffect distribution. Partitioning only affects how each AMP sorts the rows they get. To handle queries when you partition by a column that is not part of the Primary Index you can assign a Unique Secondary Index or you can include the partition column in your SQL A partitioned table will always add two bytes to every row as part of the Row-ID. If a table is partitioned, the partition number is placed in front of the Row-ID for each row. This combination of the Partition number, Row-Hash, and Uniqueness value are now called the ROW KEY. Instead of sorting by the Row-ID we are merely first sorting by the Partition Number. We are really just sorting by the Row Key! If a table is NOT partitioned the Partition Number is merely set to ZERO! While accessing PPI the cylinder index of the AMP is queried to find out on which cylinder the first data block of the accessed partition is located. For NUPI many rows can have the same primary index and will hash to the same AMP. However, each of this row could belong to a different partition. For UPI if partition is allowed without partition column beingpart of the primary index. Any update or insert statement wouldrequire Teradata to check each partition to avoid the creation of duplicates. This is very inefficientfrom a performance point of view. In case the primary index is not including thepartitioning columns, each time a primary index access is required, the responsibleAMP has to scan all its partitions for this particularprimary index. This will not be the case if you include the partition columns into the primary index When another table (without PPI) is joined with PPI table on PI=PI condition. If one of the tables is partitioned, the rows won't be ordered the same, and the task, in effect, becomes a set of sub-joins, one for each partition of the PPI table. This type of join is sliding window join

Types of PartitionsPartitioning with CASE_NPRIMARY INDEX(Customer_Number)PARTITION BY CASE_N(Order_Total "No Primary Index". Creates tables without PI.However note that if this option is chosen as N, and we create a table without explicit PI but table has UNIQUE or PRIMARY KEY defined , then UNIQUE and PRIMARY KEY take precedence over the 'N' option and the table is created with a Unique Primary index.

Consider using NOPI tables during the ETL-Process in case Teradata has to do full table scans like SQL transformations carried out on each row etc. In NOPI Table hashing and redistribution is not needed only after distributing the rows randomly across the AMPs table is ready No sorting is needed. Further, as rows are assignedrandomly to the AMPs, your data will always be distributed evenly across all AMPs and no skewing will occur. This makes loading faster we can say only the acquisition phase of the loading utilities is executed. Another advantage of NOPI tables is that records are always appended to the end of the tables data blocks thus avoiding any overhead normally caused by sorting the data by rowhash into the data blocks. For example in case you INSERTSELECT huge amounts of rows into your NOPI table this will reduce IOs significantly compared against primary index tables. NOPI tables being bulk loaded are never skewed. Still, if you INSERTSELECT from a primary index table into a NOPI table local copying of the rows will be applied.Basically, no primary index tables are not designed for being production tables

There are some further restrictions if you decide to use no primary index tables. Here are the most important: Multiload is not supported for NOPI table as multiload makes use of PI for its operation Only MULTISET tables can be created No identity columns can be used No PI tables cannot be partitioned with a PPI No statements with an update character allowed (UPDATE,MERGE INTO,UPSERT), still you can use INSERT,DELETE and SELECT No Permanent Journal possible Cannot be defined as Queue Tables and No Queue Tables allowed Update Triggers cannot update a NOPI table (probably introduces with a later release) No hash indexes are allowed (use join indexes instead)

Although above restrictions apply to NOPI tables, you still can use the below features as usual: Fallback protection of the table Secondary Indexes (USI, NUSI) Join Indexes CHECK and UNIQUE constraints Triggers Collection of statistics

Secondary Indexes

Secondary Indexes provide an alternate path to the data, and should be used on queries that runthousands of times. Teradata runs extremely well without secondary indexes. Requires extra perm space to store subtables and overhead for their maintenance

When a USI is designated on a table, each AMP will build a subtable to point back to the base table. When a Non-Unique Secondary Index (NUSI) is designated on a table, each AMP will build a subtable. The NUSI subtable is said to be AMP local because each AMP will create its secondary index subtable to point to its own base rows. You can have up to 32 secondary indexes for a table. Secondary Indexes provide an alternate path to the data and uses permanent storage space Every secondary index defined causes each AMP to create a subtable. USI subtables are hash distributed. USI queries are Two-AMP operations. NUSI subtables are AMP local. NUSI queries are All-AMP operations, but not Full Table Scans. For NUSI if an AMP contain duplicate value only one subtable row will have multiple base row id Value-Ordered NUSIs can be any non-unique index of integer type. Always Collect Statistics on all NUSI indexes. The PE will decide if a NUSI is strongly selective and worth using over a Full Table Scan. Use the Explain function to see if a NUSI is being utilized or if bitmapping is taking place Secondary index subtable contains Secondary Index value, Secondary index row id, Primary index row id

Value Ordered NUSIWhen a Value Ordered Non-Unique Secondary Index (Value Ordered NUSI) is designated on atable, each AMP will build a subtable. IN Value Ordered NUSI instead of the subtable being sorted by Secondary Index Value HASH it issorted numerically by the SI column. Value Ordered NUSI are efficient for processing queries with range conditions and inequality conditions on the secondary index column

Advantages: A secondary index might be created and dropped dynamically A table may have up to 32 secondary indexes. Secondary index can be created on any column. Either Unique or Non-Unique It is used as alternate path or Least frequently used cases. ex. defining SI on non indexed column can improve the performance, if it is used in join or filter condition of a given query.Disadvantages Since Sub tables are to be created, there is always an overhead for additional spaces. They require additional I/Os to maintain their sub tables. The Optimizer may, or may not, use a NUSI, depending on its selectivity. If the base table is Fallback, the secondary index sub table is Fallback as well. If statistics are not collected accordingly, then the optimizer would go for Full Table Scan.

NUSI bitmapping used when multiple NUSI used with AND condition. Identifies common row id before retrieving the base table rows

Joins and Join Indexes in Teradata

Teradata's Optimizer has the ability to interpret a user's join types and then make decisions onwhat should be best join strategy to take in order complete the query. Basically, joins arecombining rows from two or more tables.

The Key Things about Teradata and Joins Each AMP holds a portion of a table. Teradata uses the Primary Index to distribute the rows among the AMPs. Each AMP keeps their tables separated from other tables like someone might keep clothes in a dresser drawer. Each AMP sorts their tables by Row ID. For a JOIN to take place the two rows being joined must find a way to get to the same AMP. If the rows are not naturally on the same AMP then Teradata will perform two strategies to get them placed together. Teradata will redistribute one or both of the tables in spool or it will copy the smaller table to all of the AMPs.

In Teradata, we have determines type of joinstrategy to be used based on user input taking performance factor in mind.In Teradata, some of common join types are used like Inner join (can also be "self join" in some cases) Outer Join (Left, Right, Full) Exclusion Cross join (Cartesian product join)

Merge Join Merge join is a concept in which rows to be joined must be present in same AMP. If the rows tobe joined are not on the same AMP, Teradata will either redistribute the data or duplicate thedata in spool to make that happen based on row hash of the columns involved in the joinsWHERE Clause. If two tables to be joined have same primary Index, then the records will be present inSame AMP and Re-Distribution of records is not required.

There are four scenarios in which redistribution can happen for Merge Join Case 1: If joining columns are on UPI = UPI, the records to be joined are present in SameAMP and redistribution is not required. This is most efficient and fastest join strategy Case 2: If joining columns are on UPI = Non Index column, the records in 2nd table has tobe redistributed on AMP's based on data corresponding to first table. Case 3: If joining columns are on Non Index column = Non Index column , the both thetables are to be redistributed so that matching data lies on same amp , so the join can happenon redistributed data. This strategy is time consuming since complete redistribution of boththe tables takes across all the amps Case 4: For join happening on UPI = Non Index column, If the Referenced table (second table in the join)is very small, then this table is duplicated /copied on to every AMP.

Nested JoinNested Join is one of the most precise join plans suggested by Optimizer. Nested Join workson UPI/USI used in Join statement and is used to retrieve the single row from first table . Itthen checks for one more matching rows in second table based on being used in the join usingan index (primary or secondary) and returns the matching results.

Select EMP.Ename,DEP.Deptno, EMP.salaryfromEMPLOYEE EMP ,DEPARTMENT DEPWhere EMP.Enum= DEP.Enumand EMp.Enum= 2345; -- this results in nested join

Hash joinHash join is one of the plans suggested by Optimizer based on joining conditions. Hash Join is a close relative of Merge based on its functionality. In case of merge join, joiningwould happen in same amp. In Hash Join, one or both tables which are on same amp are fitcompletely inside the AMP's Memory. Amp chooses to hold small tables in its memory forjoins happening on ROW hash.Advantages of Hash joins areThey are faster than Merge joins since the large table doesnt need to be sorted.Since the join happening b/w table in AMP memory and table in unsorted spool, it happensso quickly.Hash Join gets its name from the fact that onesmallertable is built as hash-table, and potential matching rows from the second table are searched by hashing against the smaller table. Usually optimizer will first identify a smaller table, and then sort it by the join column row hash sequence. If the smaller table is really small and can fit in the memory, the performance will be best. Otherwise, the sorted smaller table wills be duplicated to allthe AMPs. Then the larger table is processed one row at a time by doing abinary searchof the smaller table for a match

Exclusion JoinThese type of joins are suggested by optimizer when following are used in the queriesNOT IN, EXCEPT, MINUS, SET subtraction operations

Select EMP.Ename,DEP.Deptno, EMP.salaryfromEMPLOYEE EMPWHERE EMP.EnumNOT IN( Select EnumfromDEPARTMENT DEPwhere Enumis NOT NULL );

Please make sure to add an additional WHERE filter with IS NOT NULL sinceusage of NULL in a NOT IN list will return no results.

Product Joins Product Joins compare every row of one table to every row of another table. They are called product joins because they are a product of the number of rows in table one multiplied by the number of rows in table two. For example, if one table has five rows and the other table has five rows, then the Product Join will compare 5 x 5 or 25 rows with a potential of 25 rows coming back. To avoid a product join, check your syntax to ensure that the join is based on an EQUALITY condition. A Product Join always results when the join condition is based on Inequality. The reason the optimizer chooses Product Joins for join conditions other than equality is because Hash Values cannot be compared for greater than or less then comparisons. A Product Join, Merge Join, and Exclusion Merge Join always requires SPOOL Files To Implement Product Join Identify the smaller table then duplicate it in spool on all AMPs.Join each spool row of the smaller table to every row of the larger table.

Join processingEach AMP performs join processing in parallel. Optimizer chooses best join strategy based on Available indexes, andData Demographics (Collect Statistics/Dynamic Sampling) Rows must be on the same AMP for matching. Teradata temporarily moves the rows to same AMP if they are not in the same AMP for join. This is called row redistribution.

Join IndexesJoin Index is an index structure that stores and maintains results from joining two or more tables

Join Indexes provide the means of improving performance on any type of recurring query thatinvolves joins and/or aggregate functions. A Join Index pre-joins tables and physically keeps themon disks. The closest option of having materialized view in case of Teradata is by using JOIN index Join index is the index structure that can contains columns from one or more tables. Note that once this is created, it available only to optimizer. Its the optimizer who decides whether to use join index or not. This index can never we directly accessed by the user. JOIN index helps in joining tables by providing the data needed by using index itself and also by avoiding redistribution of data in many cases. JOIN index once created we dont need to maintain the index , RDBMS does that automatically which means that when the base rows change the join index is also changed automatically. When creating the JOIN index we specifya primary index. The primary index gets assigned irrespective of whether we explicitly specify one or not. Primary index is used to redistribute the index rows across the AMP's. The index rows on the AMP's are sequenced by the hash value of the primary index of the join index. However this type of sequencing is not beneficial in range processing. Hence we have a option to use a ORDER BY clause to override the default sequencing A join index with outer join covers both inner join query as well as outer join query

Following are the types of JOIN indexes: Multiple table Join index: This type of index is used to pre-join the tables, which can help prevent redistribution of data Single table Join index: This type of Join index is used to rehash and redistribute the rows of a single table based on specified columns Aggregate Join index: Aggregate join index is used to create summary table. Sparse Join Indexes are a type of Join Index which contains a WHERE clause that reduces the number of rows which would otherwise be included in theindex. All types of join indexes, including single table, multitable, simple oraggregate can be sparse.

CREATE JOIN INDEX OrderByCustomerAS SELECT departmentname, d.DEPARTMENTNO, employeeid, salary, hiredateFROM department d Join employee e on d.departmentno=e.departmentnoprimary index (departmentno);

CREATE INDEX(O_Orderdate) ORDER BY VALUES ON OrderByCustomer;COLLECT STATISTICS ON OrderByCustINDEX(O_Orderdate);

Single table JI is to rehash and redistribute the rows of the table by column other than the primary index. Assume a scenario where we join a two table and one of the two tables needs to get distributed on the join column so that join can be performed. This would be time consuming if the table is very huge. However we can create a single table join index on this table with the column used for redistribution as the primary index of the join index. Thus rows will be pre-distributed and hence there wont be any re distribution while performing the join and thus will speed up the join.

Hash Index Hash Indexes minimize disk IOs by offering an alternate access path to the data records.The Query can be covered if Hash index has all the columns. The base table rows can also be accessed as each row carries ROWID if the index is not covering HI allows you to define distribution Key which cannot be done with secondary index. Columns used for data distribution have to be part of the columns which make up the hash index. Maintained automatically by the system hence has overhead.

Limitations A hash index cannot have a partitioned primary index A hash index cannot have a non-unique secondary index. Hash indexes cannot be specified for NOPI or columnpartitioned base tables as they are designed aroundthe Teradata hashing algorithm for data partitioning (like ROWID pointers). A hash index cannot be column partitioned A hash index must have a primary index but a singletable join index can be created with or without a primary index

Difference between HI and single table join index A hash index cannot have a partitioned primary index, but a single-table join index can A hash index must have a primary index, but a single-table join index can be created with or without a primary index if it is column-partitioned.

CREATE HASH INDEX HIOrder (O_CustKey ,O_OrderDate,O_TotalPrice) ON OrderTbl BY (O_CustKey) ORDER BY (O_CustKey)

Tables

Global Temporary tables (GTT) When they are created, its definition goes into Data Dictionary. When materialized data goes in temp space. Data is active up to the session ends, and definition will remain there up-to its not dropped using Drop table statement. If dropped from some other session then its should be Drop table all; Can collect stats on GTT. It is used whenever there is a need for a temporary table with same table definition for all users.

Volatile Temporary tables (VTT) Table Definition is stored in System cache Data is stored in spool space that's why, data and table definition both are active only up to session ends. No collect stats for VTT. If you are using volatile table, you cannot put the default values oncolumn level ( while creating table ) The LOG option allows a Volatile Table to use the Transient Journal during transactions

ON COMMIT { PRESERVE | DELETE } ROWSLOG | NO LOGThe transient journal maintains a copy of before images of all rows affected by the transaction. In the event of transaction failure, the before images are reapplied to the affected tables, then are deleted from the journal, and a rollback operation is completed In the event of transaction success, the before images for the transaction are discarded from the journal at the point of transaction commit.The main difference between the Permanent Journal and the Transient Journal is that the Transient Journal is used to rollback a transaction in case of a failure, and is automatic, while the Permanent Journal is used to recover all or some of the database from a specified point in time, and is user created.Fallback protects your data by storing a second copy of each row of a table on an alternative "fallback AMP". If one AMP fails, the system accesses the fallback rows to meet the request. Fallback tables allow users to access data even if one AMP fails.The purpose of a permanent journal is to maintain a sequential history of all changes made to the rows of one or more tables. Permanent journals help protect user data when users commit, uncommits or abort transactions. A permanent journal can capture a snapshot of rows before a change, after a change, or both. Permanent journaling is usually used to protect data. like in case of the automatic journal, the contents of a permanent journal remain until you drop them.The MERGEBLOCKRATIO option provides a way to combine existing small data blocks into a single larger data block during full table modification operations for permanent tables and permanent journal tables. This option is not available for volatile and global temporary tables. The file system uses the merge block ratio that you specify to reduce the number of data blocks within a table that would otherwise consist mainly of small data blocksData compression:Compression in Teradata plays a very important role in saving some space and increasing the performance of SQL Query. In Teradata, COMPRESSION can be implemented in three ways:Single Value or Multi Value Compression (MVC): MVC uses a dictionary to maintain value of data and its corresponding bit pattern. So while saving, Teradata replace the exact value with the bit pattern and save it. Hence, occupying much less space. MVC works at column level and should be defined for each column explicitly for which COMPRESSION is required. The problem with MVC is you should know the values which are expected in the columnCREATE TABLE EMPLOYEES(EMP_NAME CHAR(50) COMPRESS (RAJ,KEVIN,OBAMA),EMP_LAST_DATE DATE COMPRESS,EMP_DEPT CHAR(30) COMPRESS (HR,IT,FS))PRIMARY INDEX (EMP_NAME);Algorithmic Compression (ALC): This type of compression uses Alogrithm to COMPRESS the data while storing and reverse Algorithm to DECOMPRESS the data while displaying. Using ALC is more resource intensive process

CREATE TABLE EMPLOYEES(EMP_NAME CHAR(50) COMPRESS USING ALGO_NAME DECOMPRESS USING REV_ALGO_NAME,EMP_LAST_DATE DATE,EMP_DEPT CHAR(30))PRIMARY INDEX (EMP_NAME);

Block Level Compression (BLC): This type of compression is used to Compress data at block level or table level and not at column level. The cold data or the data which is not accessed frequently is idle for compression using BLC. BLC is very resource intensive process and may take sometime for compression and decompression. However the space saving which can be achieved using this method is phenomenal.Turn BLC ONSET QUERY_BAND = BLOCKCOMPRESSION=YES; FOR SESSION;Insert into empty tableINSERT INTO EMPLOYEE_BKP AS SELECT * FROM EMPLOYEE;Turn BLC OFFSET QUERY_BAND = BLOCKCOMPRESSION=NO; FOR SESSION;

AdvantagesAllows more rows per blockReduces the number of I/OsImplemented in column levelCompression is a I/O-intensive workload.Improvement gained through the more-rows-per-block concept is significant in the Full Table Scan operations.Compression is transparent to applications.Performance TunningExplain

The Explain facility provides English like translation of the plan the SQL optimizer develops to service a request The execution cost and row count depend upon the statisticsTeradata optimizer is the cost based optimizer it looks for the lowest cost plan. It does not store the plan but dynamically generates the plan. As data demographics changes so may the plan

Join Preparation:Redistribution is needed as join steps are done by the AMPs holding the rows to be joinedYou will see something like the following in the explain output for sorting:sort to order by hash code, sort to order by row hash, sort to partition by rowkeyetc

Row retrieval strategy:You will see something like the following in the explain output for row retrieval:by way of all-row scan, by way of rowhash match scan, by way of the primary index, by the way of hash value etc.

Join Type:Finally, if the operation is a join operation, the explain output will tell you exactly which kind of join strategy was chosen: using a product join, using a single partition hash join, using a merge join, using a rowkey based merge join etc.

Confidence level:HIGH CONFIDENCE: Statistics are available on an index or columnJOIN INDEX CONFIDENCE: Join based on the unique indexLOW CONFIDENCE: Random sampling of the index. Statistics are not collected. But the where condition is having the condition on indexed column.Then estimations can be based on sampling. If stat are available then and /or clause used with non indexed columnNO CONFIDENCE:Statistics are not collected and the condition is on non indexed column.Random sampling based in the AMP row countLow and no confidence indicate need to collect stat on indexes or columns involved in restricting conditions

Difference between GROUP BY and DISTINCT DISTINCT1. It reads each row on AMP2. Hashes the column value identified in the distinct clause of select statement.3. Then redistributes the rows according to row value into appropriate AMP4. Once redistribution is completed, it Sorts data to group duplicates on each AMP and will remove all the duplicates on each amp and sends the original/unique valueP.s: There are cases when "Error : 2646 No more Spool Space " . In such cases try using

GROUP BY1. It reads all the rows part of GROUP BY2. It will remove all duplicates in each AMP for given set of values using "BUCKETS" concept3. Hashes the unique values on each AMP4. Then it will re-distribute them to particular /appropriate AMP's5. Once redistribution is completed, it Sorts data to group duplicates on each AMP and will remove all the duplicates on each amp and sends the original/unique value

Hence it is better to go for GROUP BY - when Many duplicates DISTINCT - when few or no duplicates GROUP BY - SPOOL space is exceeded

To Include the Stats collection recommendations in the explain plan.DIAGNOSTIC HELPSTATS ON FOR SESSION;At the end of the explain text is the recommended statistics for collection will be as follows/*BEGIN RECOMMENDED STATS ->16) "COLLECT STATISTICS ADW.PRODUCT COLUMN P_SIZE". (HighConf)17) "COLLECT STATISTICS ADW.PRODUCT COLUMN P_CODE". (HighConf)18) "COLLECT STATISTICS ADW.PRODUCT COLUMN P_DESC". (HighConf) */

If you want explain to stop showing recommendations for collection of stats, then use the followingDIAGNOSTIC HELPSTATS NOT ON FOR SESSION;

Diagnostic help stats has some drawbacks like It does not give any sort of indication of stale stats Stats should be chosen carefully as recommended by diagnostic help stats Care should be taken to see that too many stats on a given table can impact batch running of scripts and increases the overload of stats maintenance. If recommended stats dont show any improvements in performance, DROP them!

Explain terminologyMeaning

We do a SMScombining rows using unions

BMSMSNUSI bitmap operation

Two Amp retrieveselected based on USI

enhanced by dynamic part eliminationproduct join partition elimination

Row key basedhash join by partition

DBQL(Database Query log)

DBQL captures important information about queries that run on your system. With DBQL, you can find out everything from who uses the most CPU and when, to which step in a particular query was skewed and how much CPU each step used, and tons more. This information is critical in order to know what is going on with your system, and is even more important for upgrade situations.There are several parameters that can help us in understanding SQL Query Performance in Teradata. AMPCPUTime, TotalIOCount, SpoolUsage are three main parameters to determine SQL Query performance. Provide proper privileges to your administrative user for query logging. Determine what type of information you want to collect.

DBQL tables include: DBC.DBQLogTbl (default table, core performance data of the query) DBC.DBQLSqlTbl (full SQL text of the query) DBC.DBQLObjTbl (objects accessed by the query) DBC.DBQLStepTbl (step processing performance by the query) DBC.DBQLExplainTbl (explain text of the query) DBC.DBQLSummaryTbl (summary construct typically for tactical queries)

Example:SET QUERY_BAND = Version=1; FOR SESSION;SELECT AMPCPUTIME, (FIRSTRESPTIME-STARTTIME DAY(2) TO SECOND(6)) RUNTIME, SPOOLUSAGE/1024**3 AS SPOOL_IN_GB, CAST(100-((AMPCPUTIME/(HASHAMP()+1))*100/NULLIFZERO(MAXAMPCPUTIME)) AS INTEGER) AS CPU_SKEW, MAXAMPCPUTIME*(HASHAMP()+1) AS CPU_IMPACT, AMPCPUTIME*1000/NULLIFZERO(TOTALIOCOUNT) AS LHRFROM DBC.DBQLOGTBLWHERE QUERYBAND = Version=1;

Above query gives you detailed insight about how good or bad each step of your query is: The total CPU Usage The Spool Space needed The LHR (ratio between CPU and IO usage) The CPU Skew The Skew Impact on the CPU Goal is to reduce total CPU usage, consumed spool space and Skew impact on the CPU. Further, the LHR is optimally around 1.00

You can add or remove columns per your requirement. However the ones highlighted are important parameters for determining any Query Performance in Teradata. If the AMPCPUTIME is high, you have to tune your query to make sure it performs well.Three points to consider while running the above mentioned query: You may not see results immediately after running your sql queries. There is few minute delay when query information comes to DBQL tables. The above mentioned query may take some time to give output. The reason behind it is the NOT SO PROPER Index columns for these two tables. When we check the PRIMARY INDEX columns for both the tables, we observe that the PI is same. Both the tables have ProcID, CollectTimeStamp as PI however the value for CollectTimeStamp can be different for same query in both the tables. Hence joining on the basis of second column in not advisable. Therefore, you cannot leverage PI completely here hence the query may take some time for giving results. To get the SessionID, just run SEL SESSION; command in the same session in which you are running your queries.So now on, never say that query which took the maximum time is the worst. Fetch the Query DBQL stats and check the worst query yourself.

DBQL ViewsDBC.QryLog contains the details about the query with respect to the user, session, application, type of statement, CPU, IO, and other fields associated with a particular query.DBC.QryLogSQL contains the SQL statements. If a SQL statement is exceeds a certain length it is split across multiple rows which is denoted by a column in this table. If you join this to the main Query Log table care must be taken if you are aggregating and metrics in the Query Log table. Although more often then not if your are joining the Query Log table to the SQL table you are not doing any aggregation.DBC.QryLogObjects contains the objects used by a particular query and how they were used. This includes tables, columns, and indexes referenced by a particular query. These tables can be joined together in DBC via QueryID and ProcID.

SET QUERY_BAND='PROJECT=TeraTuningBlog;TASK=QB_example;' for session;selqueryband, NumResultRows, NumSteps, TotalIOCount, AMPCPUTime, ParserCPUTime, NumOfActiveAMPs, MaxCPUAmpNumber, MinAmpIO,MAxAmPIO, MaxIOAmpNumber, SpoolUsagefromdbc.dbqlogtblwhere trim(queryband) LIKE %QUERY1=% and queryText LIKE %SELECT%Viewpoint The Teradata Viewpoint Workload Designer port lets users define Active System Management rules (such as filters and throttles) according to which workload is managed. Provides systems management via web browserProvides a single operation view for multiple systemsHighly customizable and can be personalized Teradata Management Port lets are the replacement for Teradata Manager and PMON Teradata viewpoint provides System Overview, Workload Management, Session Management, Utilities, Application, Node overview, Trends Teradata viewpoint shows the session ID, user ID,runtime, expected row count,spool space occupied, approximate completion time of the query. Viewpoint shows the details of active sessions only.

StatisticsCOLLECT STATISTICS scans columns and indexes of a table and records demographics of the data.COLLECT STATISITICS is used to provide the Teradata Optimizer with as much information on data as possible. The Optimizer uses this information to determine how many rows exist and which rows qualify for given values.Collecting statistics can improve the execution of a SQL. The optimizer can have more details about each column or index, and therefore determine a better join plan for resolving the query.

Collect stats derives the data demographics of the table. These demographics areuseful for optimizer to decide the execution of given query which in turn improvesperformance.It collects the information like: total row counts of the table, how many distinct values are there in the column, how many rows per value, is the column indexed, if so unique or non unique etc.

Features Teradata uses a cost based optimizer and cost estimates are done based on statistics.So if you dont have statistics collected then optimizer will use a Dynamic AMPSampling method to get the stats. If your table is big and data was unevenly distributedthen dynamic sampling may not get right information and your performance will suffer Collected statistics are stored in DBC.TVFields or DBC.Indexes tables. However, thesetwo tables cannot be queried. Run the Help Stats command on that table.e.g HELP STATISTICS TABLE_NAME ;This will give you Date and time when stats were last collected. You will also see statsfor the columns ( for which stats were defined) for the table Typical collect stat roughlyif 10% of the data has changed. (By measuring delta inperm space since last collected.) Recollect based on stats that have aged 60-90 days. (say last time stats collectedwas 2 months ago) . Collect stats could be pretty resource consuming for large tables. So it is alwaysadvisable to schedule the job at off peak period A optimizer would prefer FTS over NUSI, when there are no Statistics defined on NUSI columns

Here are some excellent guidelines on when to collect statistics: All Non-Unique indices Non-index join columns The Primary Index of small tables Primary Index of a Join Index Secondary Indices defined on any join index Join index columns that frequently appear on any additional join index columns that frequently appear in WHERE search conditions Columns that frequently appear in WHERE search conditions or in the WHERE clause of joins.

Statistics are especially informative if index values are distributed unevenly.When a query uses conditionals based on non-unique index values, then Teradata uses statistics to determine whether indexing or a full search of all table rows is more efficient.If Teradata determines that indexing is the best method, then it uses the statistics to determine whether spooling or building a bitmap would be the most efficient method of qualifying the data rows.

Without COLLECT STATISTICS the Optimizer assumes: Non-unique indexes are highly non-unique. (Lots of rows per value). Non-Index columns are even more non-unique than non-unique indexes. (Lots of rows per value) Teradata derives row counts from a random AMP sample for: Small tables (less than 1000 rows per amp),Unevenly distributed tables (skewed row distribution due to PI). Random amp sample: Look at data from 1 amp of table, and from this, estimate the total rows in the table. Random amp samples may not represent the true total number of rows in the table because the rows in the table may not be distributed evenly. This occurs often with small tables. As of 9/2000, per table, the random amp sample uses the same amp for each sql or query.

Hints:

The columns part of join must be of the same data type (CHAR, INTEGER,).why??When trying to join columns from two tables, optimizer makes sure that datatype is same or else it will translate the column in driving table to match that of derived table.

Do not use functions like SUBSTR, COALESCE , CASE ... on the indices used as part of Join. Why?!?add up to cost factor resulting in performance issue. Optimizer will not be able to read stats on those columns which have functions as it is busy converting functions..

Use NOT NULL where ever possible!Reason being that all the Null values might get sorted to one poor AMP resulting in infamous " NO SPOOL SPACE " Error as that AMP cannot accommodate any more Null values.

Optimization Rules

Ensure completeness and correctness of Teradata Statistics: Use DIAGNOSTIC HELPSTATS ON FOR SESSION and EXPLAIN your SQL statement. At the end of the explain output a list of statistics will be added which the optimizer would consider helpful in creating a better execution plan. Add them one by one and re-check the execution planThe Primary Index (PI) Choice: Use primary indexes for joins whenever possible, and specify in the where clause all the columns for the primary indexes. Joining on the complete set of primary index columns is the least resource intense join possibility.Teradata Indexing Techniques: Using Teradata Indexing Techniques may be another option to improve your SQL statement. For example, secondary indexes could be especially helpful if you have highly selective WHERE conditions.You could try as well join indexes or even work with partitioning. Whenever working with indexing techniques you have to keep the overall data warehouse architecture in mind and how your solution fits into this architecture.Query Rewriting: Many times, queries performance can be improved by rewriting the query in different way.Examples like using DISTINCT instead of GROUP BY on columns with many different values come to my mind.Union could be used to break up a large SQLstatements into several smaller ones, which may be executed in parallel.Real Time Monitoring: Watch your query running in real-time. Observing your query while its running in Viewpoint or PMON, helps you to find out the critical steps of your query.Most performance issues are caused either by query steps with heavy skewing in the AMPs or by a wrong execution plan caused by stale and missing statistics.Comparison of Resource Usage: Another very important task in performance optimization is to measure the resources used before and after the optimization. Plain query run times can be misleading as they heavily depend on the current load on the Teradata Server and workload management blocking you may not even notice.Here is one example query which only needs the DBC.DBQLOGTBL table. Set a different QUERYBAND for each version of the query you are running

Detect SkewingThe PDM( Physical Data Model) is one of the most obvious areas to investigate for skewing problems. Bad Primary index choice could cause uneven data distribution and impact query performance.

SELECTTABLENAME,SUM(CURRENTPERM) CURRENTPERM,CAST((100-(AVG(CURRENTPERM)/MAX(CURRENTPERM)*100)) AS DECIMAL(5,2)) AS SKEWFACTOR_PERCENTFROM DBC.TABLESIZEWHERE DATABASENAME = the_databaseGROUP BY 1ORDER BY 1DetectTeradata Skewing by analyzing the Joins: During query execution we may have to fight with dynamic skewing caused by the uneven distribution of spool files. The principle of dynamic skewing is simple:Whenever a join takes place, the rows to be joined have to be co-located on the same AMP.DetectTeradata Skewing by analyzing Column Values: While join skew described in point 2 can be detected probably quite easily by analyzing the query and having some common knowledge about the data content, there exists another hidden skewing risk caused by data demographics:Frequent column values in an evenly distributed table

Teradata Utilities

Transferring of large amount of data can be done using various Application Teradata Utilities which resides on the host computer ( Mainframe or Workstation) i.e. BTEQ, FastLaod, MultiLoad, Tpump and FastExport.

BTEQ (Basic Teradata Query) supports all 4 DMLs: SELECT INSERT, UPDATE and DELETE. BTEQ also support IMPORT/EXPORT protocols. Fastload, MultiLoad and Tpump transfer the data from Host to Teradata. FastExport is used to export data from Teradata to the Host.

BETQ: (Batch Teradata Query)(BTEQ) tool was the original way that SQL was submitted to Teradata. Its TD native utility

BTEQ can be used to submit SQL in either a batch or interactive environment.BTEQ outputs a report format, where Queryman outputs data in a format more like aspreadsheet. BETQ is also an excellent tool for importing and exportingdata.Placing the semi-colon at the beginning of the next line (followed by another statement) willbundle those statements together as one transaction.It enables users on a workstation toeasily access one or more Teradata Database systems for ad hoc queries, reportgeneration, data movement (suitable for small volumes) and databaseadministration.

BETQ ModesRecord Mode: (also called DATA mode): This is set by .EXPORT DATA. This will bring data backas a flat fileField Mode (also called REPORT mode): This is set by .EXPORT REPORT. This is the defaultmode for BTEQ and brings the data back as if it was a standard SQL SELECT statement. Theoutput of this BTEQ export would return the column headers for the fields, white space etc.Indicator Mode: This is set by .EXPORT INDICDATA. This mode writes the data in data mode, butalso provides host operating systems with the means of recognizing missing or unknown data(NULL) fields. This is important if the data is to be loaded into another Relational Database System(RDBMS).DIF Mode: Known as Data Interchange Format, which allows users to export data from Teradata tobe directly utilized for spreadsheet applications like Excel, FoxPro and Lotus.

Return Code Descirption00 Job completed with no errors.02 User alert to log on to the Teradata DBS.04 Warning error.08 User error.12 Severe internal error

Override Code Description.QUIT 15.EXIT 15

Bteq Export: All bteq export processes should use the close option of the export command and the 'set retry off' to ensure that the process aborts immediately upon a DBMS restart. If not, the export will reconnect sessions when Teradata is available again, retransmitting rows already sent.Bteq Import: All bteq import processes populating empty tables should be preceded by a delete of that table for restart ability. Import will not automatically reconnect sessions after a Teradata restart. The job must be manually restarted.

Create the BTEQ Script

To create and edit a BTEQ script we can use an editor on or client workstation. For example, on a UNIX workstation we can se text editor.

.SET SESSION TRANSACTION ANSI.LOGON TDUSER/tdpasswordSELECT emp_name --Name of employee of Dept table ,department_name --Name of Department of Dept table FROM dept;.QUITand save it with name test.scr

Step 2:

To Execute The Script:

Start BTEQ, then enter the following BTEQ command to submit a BTEQ script:Format .run file =

Example :-.run file=test.scr

Teradata Fast Load Main use: to load empty tables at high speed. The target tables must be empty in order to use FastLoad Supports inserts only - it is not possible to perform updates or deletes in FastLoad Although Fastload uses multiple sessions to load the data, only one target table can be processed at a time The maximum number of concurrent Teradata Fastload tasks can be adjusted by a system administrator. Fastload runs in two operating modes: Interactive and Batch An errlimit count should be specified on all fastload utility Duplicate rows will not be loaded

OGON 127.0.0.1/username,password;BEGIN LOADING DB.FLOAD_TEST ERRORFILES db1.fload_test _err1, db1.fload_test _err2;DEFINE in_transno (INTEGER),in_transdate (CHAR (10), NULLIF='0000-00-00'),in_accno (INTEGER),in_trans_id (CHAR(10)),in_trans_amt (DECIMAL(12,2))FILE = TestFloadData;INSERT INTO DB.FLOAD_TESTVALUES (:in_transno, :in_transdate (FORMAT 'YYYY-MM-DD'), :in_accno, :in_trans_id, :in_trans_amt );END LOADING;LOGOFF;

Fastload Performance:CHECKPOINTS:Fastload provides the capability to issue a checkpoint in the 2nd phase of the fastload. This checkpoint is an increment of rows being loaded into the table. A checkpoint is issued after each increment of rows. If a fastload process gets aborted in the 2nd phase, then the fastload can be rescued from the last checkpoint completed.

TABLE UPDATE PROCESS:The fastload table update process is typically composed of 3 steps:1) Fastload: Reads the unix records and loads them into temporary database table.2) Delete: Deletes the rows from the permanent table, where the primary indexes of the temporary and permanent tables match.3) Insert Inserts the rows from the temporary table into the permanent table.

RestrictionsNo Secondary Indexes are allowed on the Target TableNo Referential Integrity is allowed.No Triggers are allowed at load timeDuplicate Rows (in Multi-Set Tables) are not supported.No AMPs may go down (i.e., go offline) while FastLoad is processingNo more than one data type conversion is allowed per column during a FastLoad

Error and Log Tables

Log Table: FastLoad needs a place to record information on its progress during a load. It uses thetable called Fastlog in the SYSADMIN database. This table contains one row for every FastLoadrunning on the system. In order for your FastLoad to use this table, you need INSERT, UPDATEand DELETE privileges on that table.Empty Target Table: We have already mentioned the absolute need for the target table to beempty.Two Error Tables: Each FastLoad requires two error tables. These are error tables that will onlybe populated should errors occur during the load process. These are required by the FastLoadutility, which will automatically create them for you; all you must do is to name them. The first errortable is for any translation errors or constraint violations. For example, a row with a columncontaining a wrong data type would be reported to the first error table. The second error table is forerrors caused by duplicate values for Unique Primary Indexes (UPI). FastLoad will load just oneoccurrence for every UPI. The other occurrences will be stored in this table. However, if the entirerow is a duplicate, FastLoad counts it but does not store the row.When CHECKPOINT is requested, it allowsFastLoad to resume loading from the first row following the last successful CHECKPOINT

Fast Export FastExport is known for its lightning speed when it comes to exporting vast amounts of data from Teradata and transferring the data into flat files on either a mainframe or network-attached computer. In addition, FastExport has the ability to use OUTMOD routines, which provide the user the capability to write, select, validate, and preprocess the exported data. A good rule of thumb is that if you have more than half a million rows of data to export to either a flat file format or with NULL indicators, then FastExport is the best choice to accomplish this task. FastExport is extremely attractive for exporting data because it takes full advantage of multiple sessions, which leverages Teradata parallelism. FastExport can also export from multiple tables during a single operation.

How FastExport WorksWhen FastExport is invoked, the utility logs onto the Teradata database and retrieves the rows thatare specified in the SELECT statement and puts them into SPOOL. From there, it must build blocksto send back to the client. In comparison, BTEQ starts sending rows immediately for storage into afile.If the output data is sorted, FastExport may be required to redistribute the selected data two timesacross the AMP processors in order to build the blocks in the correct sequence. Remember, a lot ofrows fit into a 64K block and both the rows and the blocks must be sequenced. While all of thisredistribution is occurring, BTEQ continues to send rows. FastExport is getting behind in theprocessing. However, when FastExport starts sending the rows back a block at a time, it quicklyovertakes and passes BTEQ's row at time processing.The other advantage is that if BTEQ terminates abnormally, all of your rows (which are in SPOOL)are discarded. You must rerun the BTEQ script from the beginning. However, if FastExportterminates abnormally, all the selected rows are in worktables and it can continue sending themwhere it left off. Pretty smart and very fast!

Restrictions FastExport only supports the SELECT statement. FastExport EXPORTS data from Teradata. Choose FastExport over BTEQ when Exporting Data of more than half a million+ rows FastExport supports multiple SELECT statements and multiple tables in a single run FastExport supports conditional logic, conditional expressions, arithmetic calculations, and data conversions. FastExport does NOT support error files or error limits. FastExport supports user-written routines INMODs and OUTMODs. FastExport allows Can write INMOD and OUTMOD routines so you can select, validate and preprocess the exported data

The Teradata RDBMS will only support a maximum of 15 simultaneous FastLoad, MultiLoad, orFastExport utility jobs. This maximum value is determined and configured in the DBS Controlrecord. This value can be set from 0 to 15. When Teradata is initially installed, this value is set at 5.The reason for this limitation is that FastLoad, MultiLoad, and FastExport all use large blocks totransfer data. If more then 15 simultaneous jobs were supported, a saturation point could bereached on the availability of resources.

FastExport has two modes: RECORD or INDICATOR. In the mainframe world, only use RECORDmode. In the UNIX or LAN environment, INDICATOR mode is the default, but you can useINDICATOR mode if desired. The difference between the two modes is INDICATOR mode will setthe indicator bits to 1 for column values containing NULLS.

MLoad Main use: Load, update and delete large tables in Teradata in a bulk mode Efficient in loading very large tables Multiple tables can be loaded at a time. Updates data in a database in a block mode (one physical write can update multiple rows) Uses table-level locks Resource consumption: loading at the highest possible throughput Duplicate rows allowed Can perform DML Operations on up to five (5) empty or populated target tables at a time

Overview Multiload is faster than Bteq for updating a populated table. Bteq updates 1 row at a time,where Multiload updates blocks of rows at a time. When Multiload is compared to the Fastload/delete/insert method, then Multiload is faster for volumes above 10,000 records. For volumes less than 10,000 records, the difference is seconds, and is negligible. Multiloads speed is not affected by the number of rows already in the target table. The speedis affected by the number of update records, and can be affected by the number of error records written to the error journals. MultiLoad delete is faster then normal Delete command, since the deletion happens in datablocks of 64Kbytes, where as delete command deletes data row by row. Whenever we define a SI an SI subtable is created in each AMP.For USI they go for a hashdistribution, and hence the actual data row pointed by the USI subtable rows in one AMP maynot be in the same AMP as the subtable. So the AMPs have to communicate, which is notsupported by Multiload. For NUSI the subtable will store references of only those actual datarows who exist in the same AMP as the subtable,they all point to the data in their own AMPhence AMPs dont need to communicate here.Thus the AMPs work in parallal with NUSI and hence Mload supports that. We can Load SET, MULTISET tables using Mload, But here when loading into MULTISET table using MLOAD duplicate rows will not be rejected MultiLoad supports the following five format options: BINARY, FASTLOAD, TEXT, UNFORMAT and VARTEXT

MultiLoad provides two types of operations via modes:MultiLoadIMPORT mode, Supports up to twenty (20) INSERTs, UPDATEsor DELETEs on up to five target tables.For UPDATEsor DELETEs to be successful in IMPORT mode, they must reference the Primary Index in theWHERE clause.MultiLoad DELETE mode is used to perform a global (all AMP) delete on just one table.The reason to use .BEGIN DELETE MLOAD is that it bypasses the Transient Journal (TJ) and canbe RESTARTed if an error causes it to terminate prior to finishing. When performing in DELETEmode, the DELETE SQL statement cannot reference the Primary Index in the WHERE clause.This due to the fact that a primary index access is to a specific AMP; this is a global operation.

Restrictions: Unique Secondary Indexes are not supported on a Target Table But unlike FastLoad, it does support the use of Non-Unique Secondary Indexes (NUSIs) because the index subtable row is on the same AMP as the data row. Referential Integrity is not supported Triggers are not supported at load time No concatenation of input files is allowed The host will not process aggregates, arithmetic functions or exponentiation Import task require use of PI (Primary Index). MultiLoad Utility doesnt support SELECT statement.

Multiload PhasesPreliminary Phase Checks SQL syntax and MultiLoad commands are valid. All MultiLoad sessions with Teradata need to be established The general rule of thumb for the number of sessions to use for smaller systems is the following: use the number of AMPs plus two more. the extra two sessions are for first one is a control session to handle the SQL and logging. The second is a backup or alternate for logging. The final task of the Preliminary Phase is to apply utility locks to the target tablesDML Transaction Phase Teradata's Parsing Engine (PE) parses the DML and generates a step-by-step plan to execute the request.ACQUISITION Transaction Phase PE's plan stored on each AMP, MultiLoad is now ready to receive the INPUT data. MultiLoad now acquires the data in large, unsorted 64K blocks from the host and sends it to the AMPs. Each receiving AMP hashes each row on the primary index and sends it over the BYNET, but the row are not yest inserted in Target table The AMP puts all of the hashed rows it has received from other AMPs into the worktables Application Phase The purpose of this phase is to write, or APPLY, the specified changes to both the target tables and NUSI subtables. Every hashsequence sorted block from Phase 3 and each block of the base table is read only once to reduce I/O operations to gain speed. Then, all matching rows in the base block are inserted, updated or deleted before the entire block is written back to disk, one time.Clean Up Phase This being the case, all empty error tables, worktables and the log table are dropped. All locks,both Teradata and MultiLoad, are released.

Mload also uses 2 error tables (ET and UV), 1 work table and 1 log table1. ET TABLE - Data error :MultiLoad uses the ET table, also called the Acquisition Phase error table, to store data errorsfound during the acquisition phase of a MultiLoad import task.It contains constraint violations2. UV TABLE - UPI violations :MultiLoad uses the UV table, also called the Application Phase error table, to store data errorsfound during the application phase of a MultiLoad import or delete taskApart from error tables, it also has work and log tables It contains UniquePrimary Index violations.3. WORK TABLE - WTMload loads the selected records in the work table. The worktables are created in a database using PERM space4. LOG TABLEA log table maintains record of all checkpoints related to the load job, it is essential l/mandatory to specify a log table in mload job. This table will be useful in case you have a jobabort or restart due to any reason.Mload Options DUPLICATE INSERT ROWS: This option logs an entry for all duplicate INSERT rows in the UV_ERRtable. Use this when you want to know about the duplicates. IGNORE DUPLICATE INSERT ROWS: This tells MultiLoad to IGNORE duplicate INSERT rows because you do not want to see them. MARK DUPLICATE UPDATE ROWS: This logs the existence of every duplicate UPDATE row. IGNORE DUPLICATEUPDATE ROWS: This eliminates the listing of duplicate update row errors. MARK MISSING UPDATE ROWSThis option ensures a listing of data rows that had to be INSERTed sincethere was no row to UPDATE. IGNORE MISSING UPDATE ROWS: This tells MultiLoad NOT to list UPDATE rows as an error. This is a good option when doing an UPSERT since UPSERT will INSERT a new row. MARK MISSING DELETE ROWS: This option makes a note in the ET_Error Table that a row to be deleted is missing. IGNORE This option says, "Do not tell me that a row to be deleted is missing MISSING DELETE ROWS DO INSERT for MISSINGUPDATE ROWS This is required to accomplish an UPSERT. It tells MultiLoad that if the row to be updated does not exist in the target table, then INSERT the entire row from the data source.

MLOAD CHECKPOINT:MultiLoad will check the Restart Logtable and automatically resume the load process from the last successful CHECKPOINT before the failure occurred MultiLoad uses neither the Transient Journal nor rollbacks during a failure. That is why you must designate a Logtable at the beginning of your script. The default number for CHECKPOINT is 15 minutes, but if you specify the CHECKPOINT as 60 or less, minutes are assumed

/* Simple Mload script */.LOGTABLE SQL01.CDW_Log;.LOGON TDATA/SQL01,SQL0;Sets Up a Logtable and Logs on to Teradata.BEGIN IMPORT MLOAD TABLESSQL01.Employee_Dept1WORKTABLES SQL01.CDW_WTERRORTABLES SQL01.CDW_ETSQL01.CDW_UV;Begins the Load Process by naming the TargetTable, Work table and error tables; Notice NOcomma between the error tables.LAYOUT FILEIN;.FIELD Employee_No * CHAR(11);.FIELD Last_Name * CHAR(20);.FILLER Junk_stuff * CHAR(100);.FIELD Dept_No * CHAR(6);Names the LAYOUT of the INPUT record anddefines its structure; Notice the dots before theFIELD and FILLER and the semi-colons after eachdefinition..DML LABEL INSERTS; Names the DML LabelDO INSERT FOR MISSING UPDATE ROWS; -- optional valueINSERT INTO SQL01.Employee_Dept1(Employee_No,Last_Name,Dept_No)VALUES(:Employee_No,:Last_Name,:Dept_No);Tells MultiLoad to INSERT a row into the targettable and defines the row format.Lists, in order, the VALUES (each one precededby a colon) to be INSERTed..IMPORT INFILE CDW_Join_Export.txtFORMAT TEXTLAYOUT FILEINAPPLY INSERTS;Names the Import File and its Format type; Citesthe LAYOUT file to use tells Mload to APPLY theINSERTs..END MLOAD;.LOGOFF;Ends MultiLoad and Logs off all

Teradata Parallel Data Pump (TPump) Main use: to load or update a small amount of target table rows Sends data to a database as a statement which is much slower than using bulk mode TPump does NOT movedata in the large blocks. Instead, it loads data one row at a time, using row hash locks Resource consumption: loading speed can be adjusted using a built-in resource consumption management utility. The throughput can be turned down in peak periods. TPump does not support MULTI-SET tables. Can accomplish near real-time updates from source systems into the Teradata datawarehouse. Throttle-switch Capability. You can throttle up and down the number of updates FastLoad can only load one table and MultiLoad can load up to five tables. But, when it pulls datafrom a single source, TPump can load more than 60 tables at a time! And the number of concurrentinstances in such situations is unlimited TPump allows both Unique and Non-UniqueSecondary Indexes (USIs and NUSIs)

Following are the limitations of Teradata TPUMP Utility: Use of SELECT statement is not allowed. Concatenation of Data Files is not supported. TPump will not process aggregates, arithmetic functions or exponentiation. No more than four IMPORT commands may be used in a single load task Dates before 1900 or after 1999 must be represented by the yyyy format forthe year portion of the date, not the default format of yy. On some network attached systems, the maximum file size when usingTPump is 2GB. TPump performance will be diminished if Access Logging is used

TPUMP allows near real time updates from Transactional Systems into the DataWarehouse.It can perform Insert, Update and Delete operations or a combination from the samesource.It can be used as an alternative to MLOAD for low volume batch maintenance of largedatabases.TPUMP allows target tables to have Secondary Indexes, Join Indexes, Hash Indexes,Referential Integrity, Populated or Empty Table, Multiset or Set Table or Triggers definedon the Tables.TPUMP can have many sessions as it doesnt have session limit.TPUMP uses row hash locks thus allowing concurrent updates on the same table

Error Table per target table, not two. If you name the table, TPump will create it automatically.Entries are made to these tables whenever errors occur during the load process. Like MultiLoad,TPump offers the option to either MARK errors (include them in the error table) or IGNORE errorsThe default is to MARK.When doing an UPSERT, this default does not apply.. It is the errors that occur when the data is being moved, such asdata translation problems that TPump will want to report. Stores a portion the actual offending row for debugging