Nested loop join technique

Nested Loop Join Technique – Part 1 (Table Pre-fetching)

Background

Table Pre-fetching has been introduced in Oracle 9i and is enabled by default. This new approach gives some improvement in Nested Loop Join (NLJ) by reducing logical IO of the query. In 10g we can control this new behavior by setting a database parameter (_table_lookup_prefetch_size). It’s annoying actually but another improvement has been introduced in 11g and in that version, we have full control of this behavior simply by using SQL hints.

The objective of this test cases is to see all those behavior (normal, table pre-fetching and also the newest table batching – in 11g) when we have NLJ in our query. I am going to compare the performance of unique and non-unique Index in sorted and unsorted data, so in total we will have 4 test cases per batch. In this Part 1 I am going to run the test cases in 10g only (for normal and table pre-fetching technique) and I am planning to rerun the test cases against 11g in Part 2.

I take Randolf’s exercise as my reference (http://oracle-randolf.blogspot.com/2011/07/logical-io-evolution-part-1-baseline.html), please go to his blog and read the articles, it’s very explainable but I might miss some parts as well. So if you have time to read, then we can share the knowledge together For the monitor purpose (statistics/ wait event/ etc), I am going to use Snapper version 4 by Tanel Poder (http://blog.tanelpoder.com/2013/02/18/manual-before-and-after-snapshot-support-in-snapper-v4/). Just go to his blog as well, this guy is a genius and he has a lot of good stuffs.

In his book (Cost Based Oracle Fundamental), Jonathan Lewis has observed about table pre-fetching technique as well. This is what he has explained in the book.

http://blog.tanelpoder.com/2013/02/18/manual-before-and-after-snapshot-support-in-snapper-v4/

http://oracle-randolf.blogspot.com/2011/07/logical-io-evolution-part-1-baseline.html


Just to recap, the normal NLJ pseudo-code will be looked as below:

beginfor r_outer in (select rows from outer_table where <filter>) loop

for r_inner in (select rows from inner_table where <matched the join + filter>) loopoutput the selected columns from both tables

end loopend loopend;

With above code, output from inner table will be sorted based on outer table. In the other side, Oracle do not guarantee that the output will be sorted based on outer table. I am not too interested in testing this theory, but you can see one example in this blog http://dioncho.wordpress.com/2010/08/16/batching-nlj-optimization-and-ordering/ The pseudo-code of new NLJ technique is like the following:

beginfor r_outer in (select rows from outer_table where <filter>) loop

for r_inner in (select rows from inner_table where <matched the join + filter>) loopget the relevant rowid and put it in ‘list’

end loop

walk through the rowid ‘list’ and scan the inner_table once to get all required data;end loopend;

Test Recipes

As a starting point, I will create 5 tables with 10,000 rows each and exactly 10 rows per block, using “MINIMIZE RECORDS_PER_BLOCK” command. The purpose is to get a good figure of the number. In addition to that tables, 4 indexes will be created in the 4 inner tables (except DRIVEN). The index itself will be having BLEVEL=2 (I have to use PCTFREE=99 to force it), so the index height is 3 (ROOT BRANCH LEAF). Later in this test cases we will create a shorter index to see the impact of the query (logical read should be smaller as the index got shorter)

1. DRIVEN, driving (outer) table table name should be DRIVER or DRIVING but I mistakenly created as DRIVEN and it was already half way when I realize it

2. T_UNIQ_SORTED, inner table with Unique Index on ID column and sorted data, to show the normal NLJ

3. T_UNIQ_UNSORTED, inner table with Unique Index on ID column and sorted data, to show the normal NLJ (this is created to see the different between sorted and unsorted data)

4. T_NON_UNIQ_SORTED, inner table with non-unique Index on ID column and sorted data, to show the new table pre-fetching behavior

5. T_NON_UNIQ_UNSORTED, inner table with non-unique Index on ID column and scattered/ random ordered data, to show the new table pre-fetching behavior (this is created to see what is the differences between these techniques)

http://dioncho.wordpress.com/2010/08/16/batching-nlj-optimization-and-ordering/


Test Cases and Results

To be able to make “fair-enough” comparison, I am following these steps in this exercise. The idea is to put as much as block in the buffer to minimize physical IO. I am too lazy to create an automated script so I have done all these steps manually. Sometimes, due to an unwanted load in my VM environment, I have to rerun the test to get good data with acceptable variation.

1. Flush buffer_cache2. Warm up the buffer by:

a. Select all data from outer table, DRIVEN (full table scan)b. Scan inner table using index access (full index scan)

3. Begin snapper process from separate session4. Execute each test case (there are 4). Turn on event 10046 to trace SQL wait event and event 10200

to dump consistent gets activity.

5. End snapper process

Below are some scenarios that I have prepared and followed to see how the engine does its work. Please check below attached XLS file for the details result.

1. Normal NLJ against Unique and Non-Unique index2. Pre-fetch NLJ against Unique and Non-Unique index3. Compare the performance of index with BLEVEL=2 and BLEVEL =14. Compare the performance of random and sequential data distribution (scattered data)

It’s Number Time

With basic understanding from below table and index statistics, we expect to see around 30,000 consistent gets for the index (since we need to walk from root – branch – leaf to get the rowid) and 1,000 for

the table (with an assumption that Oracle still hold the buffer for every consecutive 10 rows) or 10,000 consistent gets (with a knowledge that we have 10,000 rows in the table).

TABLE_NAME NUM_ROWS BLOCKS AVG_ROW_LEN------------------------------ ---------- ---------- -----------DRIVEN 10000 1000 204T_UNIQ_UNSORTED 10000 1000 204T_NON_UNIQ_SORTED 10000 1000 204T_UNIQ_SORTED 10000 1000 204T_NON_UNIQ_UNSORTED 10000 1000 204

INDEX_NAME CLUSTERING_FACTOR BLEVEL LEAF_BLOCKS DISTINCT_KEYS-------------------------- ----------------- ---------- ----------- -------------T_UNIQ_UNSORTED_IDX 9993 2 10000 10000T_NON_UNIQ_UNSORTED_IDX 9989 2 10000 10000T_UNIQ_SORTED_IDX 1000 2 10000 10000T_NON_UNIQ_SORTED_IDX 1000 2 10000 10000

Normal NLJ, Unique and Non-Unique IndexLet’s start with the most basic one. Before we start this test, we need to disable pre-fetching feature

using below command and bounce the instance. If everything is in place, we should see below execution plan from both unique and non-unique version.

alter system set "_table_lookup_prefetch_size"=0 scope=spfile;

Unique Index

Non-Unique Index

Reading the tkprof output, in the unique index version, we see 20,668 consistent gets for index access, followed by exactly 10,000 for the inner-table (T_UNIQ_SORTED). While in the non-unique version, we see 30,667 consistent gets for the index access and 10,000 for the outer-table (T_NON_UNIQ_SORTED). In addition to this, we have 1,672 visits for the outer table (DRIVEN). So these facts are not matched with our expectation???To be able to answer this question, we need to enable event 10200 to dump consistent gets.

The output of event 10200 dump file is provided in above tabular attachment and we will look into it to see what was happened. Instead of 30,000 consistent gets for the index (as what we expect in the

beginning), Oracle did only 20,668 (as reported in tkprof output and also in the output of event 10200 dump file). In this case Oracle make some optimization by pinning those ROOT buffers (only 668 consistent gets out of 10,000 – in the above right most table). That is make sense since ROOT and BRANCH is kind of door or gate to enter the index data, which is in the LEAF block.

Moving to the table part, here we have extra 400 consistent gets for T_UNIQ_SORTED (actually we have 1,000 blocks and 10,000 rows) and also extra 267 for DRIVEN, which is inconsistent result if we compare to the tkprof output. What I can say from this symptom is some buffer might be being read more than once. But actually we should have 10,000 consistent gets for DRIVEN (in fact that we have only 1,000 blocks for 10,000 rows), so that 267 extra is considered as small And WHY we have inconsistent result between session statistics and the output of tkprof???As of now what I can say is, again, may be the output of tkprof is being affected by table and index statistics (product of Oracle algorithm). Of course we need to confirm it by HACKING the statistics rerun again few test cases (I will put it in my list)

Going forward to the non-unique index, finally we can able to spot the different of 10,000 consistent gets between those 2 things what is that???We have 19,999 consistent gets for LEAF block; this means additional 10,000 consistent gets! Ok good!?! When we look into the consistent get hierarchy table, after Oracle visit the inner-table, Oracle go to the next leaf to check whether that leaf has the same value with the current leaf or not. This is an extra job for Oracle when we have non-unique index, it has to check whether the next leaf has the same value or not. This behavior is not present in the unique index.

These are another interesting statistics/ wait event to be compared: consistent gets – examination related with unique index access, according to Randolf, this is “short-

cut” version of consistent gets and it could reduce the number of required latch when we want to access some buffer (I have to rerun this test and monitor the latch activity as well, may be later)

index fetch by rowid index unique scan index scan kdiixs1 index range scan buffer is (not) pinned count part of Oracle optimization to reduce consistent gets rows fetched via callback observed only in unique index scan, but I cannot find further information table scan blocks gotten why it is 1,670 blocks only while we have 2 table with 1,000 blocks each.

This is due to “warm up” activity that is executed before NLJ, so few blocks are already in the buffer, this was my first and only guess

db file sequential read confirmed that during this testing, Oracle reload few db blocks into the buffer

Pre-fetching Technique, Unique and Non-Unique IndexIn 10g, pre-fetching is enabled by default, but in most of the cases we can only see this feature in the

non-unique index access (I cannot reproduce pre-fetching output for unique index scan in this exercise). Starting 11g, Oracle is able to use pre-fetching technique against unique index scan, and that is the default behavior (this sounds to be good news).

I am going to cover the comparison between normal and pre-fetching technique against non-unique index only, since the outputs of unique index are similar (please check in the XLS file for the details). When we enable pre-fetching feature, we should be able to see below new execution plan for non-unique index scan. The “TABLE ACCESS BY INDEX” has been moved to the upper list, outside the “NESTED LOOPS”. What this means? To me, it can be translated as “instead of going forth and back from index – table to get rowid and finally access the data, Oracle can keep few rowids into a list, mostly linked list structure, and finally use single/ multi block scan to the table”. This new approach will reduce the number of consistent gets.Well let’s see the number for the confirmation, as always.

While the consistent gets for the index part is still remain the same (20,668 from event 10200 dump file output and 30,667 from tkprof output), the number of consistent gets for the table access, T_NON_UNIQ_SORTED, are significantly reduced from 10,400 in the previous test case (it should be 10,000 normally) to 1,667 consistent gets only. This is confirming the theory of pre-fetching technique, that Oracle do not go to the table directly after get a rowid from leaf block.

Except the “consistent gets” related, while checking the statistics and wait events comparison, we can see also the improvement in “buffer is pinned count” where Oracle exactly pinned more buffer for table block. But the value of this statistic is still mystery for me since I cannot figure it out from where it is coming. It will be good to see if anyone is able to get this algorithm or calculation.

Index’s HeightThe first myth about the relation of index’s height with the performance of NLJ is that the lower the

index, the smaller the consistent gets is. So let’s the number confirmed it.

During this test case, I have to recreate the index with default PCTFREE. It will create an index with BLEVEL=2 (the index doesn’t has BRANCH level) and has only 20 blocks (instead of 10,000 blocks in the previous test case). Since this is a huge different, we expect to see some improvement as well in term of consistent gets. Below is the detail of the newly created indexes.

INDEX_NAME CLUSTERING_FACTOR BLEVEL LEAF_BLOCKS DISTINCT_KEYS------------------------------ ----------------- ---------- ----------- -------------T_UNIQ_SORTED_IDX 1000 1 20 10000T_NON_UNIQ_SORTED_IDX 1000 1 21 10000

Unique Index

In the unique index version, we can see that consistent gets of index is reduced by 10,000 since we don’t have BRANCH level in the new index. The result of normal and pre-fetching technique is not different for unique index version as what we have observed before, so this improvement is purely due to the index size (we have shorter index with height = 2, or BLEVEL=1). Thus we can say, in 10g, when we are talking about unique index, Oracle always think (maybe hardcoded in the code) that scanning single index leaf is the most efficient access path. The dump output of new T_UNIQ_SORTED_IDX structure is attached in the above table for your reference, it is clearly mentioned that the number of leaf block is 20.

Non-Unique Index

I was shocked when I see the output of non-unique index with pre-fetching feature turn on. The output of event 10200 didn’t capture any ROOT/ BRANCH access of the index, and moreover the number of consistent gets for LEAF block is also reduced from 19,999 to 8,632 (it is a huge improvement indeed). It is interesting to see from where the improvement is coming, whether it is coming as part of pre-fetching technique or because we have shorter index???

The answer is easier to get since we have already know the output of unique index version before. Yes, it is due to the size of the index. Well let’s see below table for statistics and wait event comparison (for the unique index version, you can see also below comparison but to me, nothing is interesting).

“consistent gets” is reduced from 42,381 to 34,048 due to optimization from pre-fetching technique (“buffer is pinned count” make it clear that Oracle is able to pin the buffer twice, from 17,669 to 35,334) and finnaly “consistent gets” is reduced from 34,048 to 14,735, but this time due to shorter index (“buffer is pinned count” did not change during this test case, which is 35,334). We can see also the improvement in “physical reads” from 13 to 0. So in this case we can say that small is beautiful isn’t it?

How about the result for normal non-unique index version? Do we have buffer optimization as well when the index is shorter? The answer is No (please check in the XLS for details), so table pre-fetching feature is independent with index’s size.

Another interesting part is again “consistent gets – examination”, it is reduced from 10,001 to 1. And finally from another book of Jonathan Lewis, “Oracle Core Essential Internals for DBAs and Developers” I got some clue (this is another interesting book and mostly I will not be able to finish reading it), and found this one.

So it is related with latch activity, which is sadly was missed in this exercise Ok, I will not cover and talking this statistic anymore in this exercise, but I have to cover it in Part 2 or later if I have time and, more important, a willingness to rerun all these test cases against 10g.

Pre-fetching Technique, Scattered vs Sequential Data in Unique and Non-Unique IndexThe last test case is to check how Oracle handle scattered data. For unique index, everything looks

similar, regardless how scattered the data in the table is. The same thing is happened for non-unique index when pre-fetching is turn off. So, nothing is special in here, so leave it. Let’s check the non-unique index when pre-fetching is turn on.

Moving forward to non-unique index (pre-fetching is turn on), the consistent gets is increased from 34,048 to 42,369. I will attached again the output of event 10200 from non-unique index version (both sorted and unsorted)

non-unique index, pre-fetching is turn on, data is sorted

non-unique index, pre-fetching is turn on, data is unsorted (scattered)

The only different is consistent gets from the table; it is 1,667 for sorted table and 9,988 for scattered table. What is this odd 9,988 value? From where is it coming? How about this?

Isn’t it a nice coincidence? To be able to scan the data in a sequential order, Oracle need to jump 9,989 times to the different table block, this is what clustering_factor is all about. So, Oracle use this knowledge as an upper limit for consistent gets value for table access. Of course the value can be less than the clustering_factor value (part of buffer optimization), but it should not be more than that. From “buffer is pinned count” statistic we can see that Oracle did the buffer optimization for scattered data but is not that much if we compare to the sequential one.

References

http://hoopercharles.wordpress.com/2011/01/24/watching-consistent-gets-10200-trace-file-parser/ http://oracle-randolf.blogspot.com/2011/07/logical-io-evolution-part-1-baseline.html http://dioncho.wordpress.com/2010/08/16/batching-nlj-optimization-and-ordering/ http://blog.tanelpoder.com/2013/02/18/manual-before-and-after-snapshot-support-in-snapper-v4/ “Cost Based Oracle Fundamental” book

http://blog.tanelpoder.com/2013/02/18/manual-before-and-after-snapshot-support-in-snapper-v4/



http://hoopercharles.wordpress.com/2011/01/24/watching-consistent-gets-10200-trace-file-parser/

What’s Next?

This article only covers a small part of real world scenarios. There are a lot of other considerations that need to be tested to get more detail understanding about how things are working. In this article, we have not talking about how Oracle handles:

1. IOT, Index Organized Table2. Bitmap Index3. Global or Local Index4. Parallelism5. Anti Join or Semi Join6. Index Pre-fetching (if such of this feature is available)

You can add another point to make this list longer, or you can make it shorter by taking one and do the exercise. So, will you participate???

-heri-

Nested loop join technique

Technology

Transcript of Nested loop join technique