Teradata Join Processing

23
Teradata Join Processing Center of Excellence Data Warehousing Wipro Technologies

Transcript of Teradata Join Processing

Page 1: Teradata Join Processing

Teradata Join Processing

Center of Excellence

Data Warehousing

Wipro Technologies

Page 2: Teradata Join Processing

Join Processing

Rows to be joined must be on the same AMP.

For join processing, copies of some or all of the rows may have to be moved to a common AMP.

Join plans Product join. Merge join Nested join

Page 3: Teradata Join Processing

Join Processing

General scenarios: Join column is the PI of both the tables.

Join column is PI of one of the tables.

Join column is not a PI of either of the table.

Page 4: Teradata Join Processing

Case 1- PI of both the tables

Rows taking part in the join are already in the same AMP.

No data movement is necessary.

Rows are already in sorted order (within the block)

This is the best case scenario.

Page 5: Teradata Join Processing

Case 2 - PI of one of the tables

One table has its rows on the target AMP.

Rows of the other table need to be redistributed to their target AMPs by the hash code of the join column value.

If the table is small optimizer may choose to duplicate the table on all AMPs

Page 6: Teradata Join Processing

Case 3 - not a PI of either of the table

Rows of both the tables need to redistributed to their target AMPs by the hash code of the join column value.

Optimizer might choose to duplicate the smaller table on all AMPs.

This join scenario involves maximum number of data movement.

Page 7: Teradata Join Processing

Nested Join

Optimizer choose this join strategy when An equality value for a unique index (UPI or

USI) on table 1. A join on a column of that single row to any

index on table 2. This joining uses minimum system resource

PI

PI

USI

USI

NUSI

NUSI

2 AMPs

3 AMPs

3 AMPs

4 AMPs

ALL AMPs

ALL AMPs

1 OR MORE ROWS RETURNED

1 OR MORE ROWS RETURNED

1 ROW RETURNED

1 ROW RETURNED

1 OR MORE ROWS RETURNED

1 OR MORE ROWS RETURNED

UPI , data column

USI , data column

UPI , data column

USI , data column

UPI , data column

USI , data column

data value

data value

data value

data value

data value

data value

=

=

=

=

=

=

Page 8: Teradata Join Processing

Product Join

Most general for of join

Optimizer chooses product join in following conditions WHERE clause is missing. Join condition is not based on equality condition. Join conditions are ORed together. Table alias are incorrectly used. Optimizer determines that it is less expensive than

other join types.

Identify the smaller table duplicate it in spool on all AMPs. Join each spool row of the smaller table to every row of the larger table.

Page 9: Teradata Join Processing

Merge Join

Commonly done when the join conditions are based on equality.

Generally more efficient than Product Join as number of row comparisons are less.

Steps Identify the smaller table. Put the qualifying rows from one or both table into spool. Move the spool rows to the AMPs based on join column

hash (if required). Sort the spool rows by join column hash value (if

necessary). Compare those rows with matching join column hash values.

Page 10: Teradata Join Processing

Merge Join

Row Hash

Col1 Col2….

110A

110A

111B

111B

203C

203C

203C

110E

Row Hash

Col1

110A

120B

203C

210D

Page 11: Teradata Join Processing

Example

Col1

(PK)

Col2 Col3

(FK)

100 P 600

200 Q 600

300 R 700

400 S 200

500 T 500

600 X 200

700 Y 300

800 Z 500

900 A 800

1000 B 300

2000 C 300

3000 D 300

4000 E 200

Col1 (PK)

Col2……

100 K

200 L

300 M

400 N

500 O

600 P

700 Q

800 R

Table 1 Table 2

Page 12: Teradata Join Processing

Example

100 P 600800 Z 5001000 B 300

100 K800 R

400 S 200700 Y 3002000 C 3004000 E 200

400 N700 Q

300 R 700600 X 200900 A 800

300 M600 P

200 Q 600500 T 5003000 D 300

200 L500 O

Page 13: Teradata Join Processing

Row Distribution Strategy 1

No distribution needed.

No sorting needed.

Join columns of both the tables are PIs. Rows involved in the join are located in the

same AMP.

Page 14: Teradata Join Processing

Case 1 - Example

SELECT * FROM Table1 t1 INNER JOIN Table2 t2 ON t1.Col1 = t2.Col1

100 P 600800 Z 5001000 B 300

100 K800 R

400 S 200700 Y 3002000 C 3004000 E 200

400 N700 Q

300 R 700600 X 200900 A 800

300 M600 P

200 Q 600500 T 5003000 D 300

200 L500 O

Page 15: Teradata Join Processing

Row Distribution Strategy 2

Distributing and sorting one of the table on join column row hash.

Join column is PI of one of the tables. One of the tables is already distributed on join

Column Row Hash.

Optimizer redistributes one of the tables and sort on join column row hash.

Page 16: Teradata Join Processing

Case 2 – ExampleSELECT * FROM Table1 t1 INNER JOIN Table2 t2 ON t1.Col3 = t2.Col1

100 P 600800 Z 5001000 B 300

100 K800 R

400 S 200700 Y 3002000 C 3004000 E 200

400 N700 Q

300 R 700600 X 200900 A 800

300 M600 P

200 Q 600500 T 5003000 D 300

200 L500 O

900 A 800

100 K800 R

300 R 700

400 N700 Q

1000 B 3003000 D 300700 Y 3002000 C 300200 Q 600100 P 600

300 M600 P

600 X 200400 S 2004000 E 200800 Z 500500 T 500

200 L500 O

SPOOL

Page 17: Teradata Join Processing

Row Distribution Strategy 3

Duplicating and sorting the smaller table on all AMPs and locally building the larger table and sorting it.

Optimizer considers this strategy if it finds redistributing a larger table is more expensive than duplicating a the smaller table.

Page 18: Teradata Join Processing

Case 2 – Example

100 P 600800 Z 5001000 B 300

100 K800 R

400 S 200700 Y 3002000 C 3004000 E 200

400 N700 Q

300 R 700600 X 200900 A 800

300 M600 P

200 Q 600500 T 5003000 D 300

200 L500 O

1000 B 300100 P 600800 Z 500

100 K200 L300 M400 N500 O600 P700 Q800 R

400 S 2004000 E 200700 Y 3002000 C 300

100 K200 L300 M400 N500 O600 P700 Q800 R

600 X 200300 R 700900 A 800

100 K200 L300 M400 N500 O600 P700 Q800 R

3000 D 300500 T 500200 Q 600

100 K200 L300 M400 N500 O600 P700 Q800 R

SPOOL

Page 19: Teradata Join Processing

Row Distribution Strategy 4

Duplicate the smaller table on every AMP.

Optimizer chooses this strategy the join condition is not based on equality.

Product join scenario.

Page 20: Teradata Join Processing

Explain Facility

Provides an English translation of the steps chosen by the optimizer.

Very helpful to estimate the performance of complex queries.

Helps physical designers in their index selection by providing the execution strategy chosen by the optimizer.

Page 21: Teradata Join Processing

Explaining the EXPLAIN

Generally EXPLAIN outputs are clear and easy to understand however it contains few phrases one needs to be familiar with. “….with no residual conditions…” : There is no residual

conditions other than the conditions used locate the row.

“..eliminating duplicates..” : DISTINCT operation being done.

“…we do a SMS…” : Set manipulations like UNION, EXCEPT are being done.

“…we do a BMSMS…” : NUSI Bit mapping being used. “…distributed by hash code to all AMPs…” “…duplicated on all AMPs…”

Page 22: Teradata Join Processing

Statistics Optimizer needs demographic information to create best

execution plan for a query. Number of rows in the table. Row size. Number of rows per value. Index information and demographics.

Based on the statistics optimizer estimates the cost and creates the best plan.

Statistics must be collected for the columns and indexes being accessed frequently.

If Statistics are not provided, optimizer does Dynamic Sampling (Random AMP).

Page 23: Teradata Join Processing

Questions ?