Nested loop (loop over loop)
Select tab1.*, tab2.* from tabl, tab2 where tabl.col1=tab2.col2;
For j in (select * from tab2 where col2=i.col1) loop
Display results;
End loop;
End loop;
The Steps involved in doing nested loop are:
a) Identify outer (driving) table
b) Assign inner (driven) table to outer table.
c) For every row of outer table, access the rows of inner table.
In execution plan it is seen like this:
NESTED LOOPS
outer_loop
inner_loop
When optimizer uses nested loops?
Optimizer uses nested loop when we are joining tables containing small number of rows with an efficient driving condition. It is important to have an index on column of inner join table as this table is probed every time for a new value from outer table.
- No of rows of both the table is quite high
- Inner query always results in same set of records
- The access path of inner table is independent of data coming from outer table.
Note: You will see more use of nested loop when using FIRST_ROWS optimizer mode as it works on model of showing instantaneous results to user as they are fetched. There is no need for selecting caching any data before it is returned to user. In case of hash join it is needed and is explained below.
Hash join
Hash joins are used when the joining large tables. The optimizer uses smaller of the 2 tables to build a hash table in memory and the scans the large tables and compares the hash value (of rows from large table) with this hash table to find the joined rows.
The algorithm of hash join is divided in two parts
- Build a in-memory hash table on smaller of the two tables.
- Probe this hash table with hash value for each row second table
In simpler terms it works like
Build phase
For each row RW1 in small (left/build) table loop
Calculate hash value on RW1 join key
Insert RW1 in appropriate hash bucket.
End loop;
Probe Phase
For each row RW2 in big (right/probe) table loop
Calculate the hash value on RW2 join key
For each row RW1 in hash table loop
If RW1 joins with RW2
Return RW1, RW2
End loop;
End loop;
When optimizer uses hash join?
Optimizer uses has join while joining big tables or big fraction of small tables.
Unlike nested loop, the output of hash join result is not instantaneous as hash joining is blocked on building up hash table.
Note: You may see more hash joins used with ALL_ROWS optimizer mode, because it works on model of showing results after all the rows of at least one of the tables are hashed in hash table.
Sort merge join
Sort merge join is used to join two independent data sources. They perform better than nested loop when the volume of data is big in tables but not as good as hash joins in general.
They perform better than hash join when the join condition columns are already sorted or there is no sorting required.
The full operation is done in two parts:
- Sort join operation
get first row RW1 from input1
get first row RW2 from input2.
- Merge join operation
if RW1 joins with RW2
get next row R2 from input 2
return (RW1, RW2)
else if RW1 < style=""> get next row RW1 from input 1
else
get next row RW2 from input 2
end loop
Note: If the data is already sorted, first step is avoided.
Important point to understand is, unlike nested loop where driven (inner) table is read as many number of times as the input from outer table, in sort merge join each of the tables involved are accessed at most once. So they prove to be better than nested loop when the data set is large.
When optimizer uses Sort merge join?
a) When the join condition is an inequality condition (like <, <=, >=). This is because hash join cannot be used for inequality conditions and if the data set is large, nested loop is definitely not an option.
b) If sorting is anyways required due to some other attribute (other than join) like “order by”, optimizer prefers sort merge join over hash join as it is cheaper.
Note: Sort merge join can be seen with both ALL_ROWS and FIRST_ROWS optimizer hint because it works on a model of first sorting both the data sources and then start returning the results. So if the data set is large and you have FIRST_ROWS as optimizer goal, optimizer may prefer sort merge join over nested loop because of large data. And if you have ALL_ROWS as optimizer goal and if any inequality condition is used the SQL, optimizer may use sort-merge join over hash join
42 comments:
Great Job Sachin,
Thank You
Pratheej
Hi,
You have explained the concept so well.
Great work
Thanks.
Well done!
Very clear!
Tks
Hi Sachin,
Thanks a Lot.
Arvind
Hi sachin,
This is very well explained article on join algoriths.
simply great!!!!!
Best wishes....
Amol
good job man!!!
Great and Clear explanation. I've understood all just after first read.
Thank's Sachin
thats really a great explaination
Simply superb !!
Fantabulous Explanation !!!
Tendulkaristic...Sachin
Sachin very good thanx a lot
Awesome explanantion Sachin - Kudos to you!!!
Really great explanation in simple words..........thanks a lot
Cheers sachin, share more stuff with us.
With Oracle 10g its perfectly working as you mentioned like after increasing rownum ,plan changed from "Nested loop" to "Hash join" . But for Oracle 11g plan remains same, means its uses ""Nested loop" even after changing artificial stats .Itd not turn to "Hash Join"
scott>exec dbms_stats.set_table_stats(ownname => 'SCOTT', tabname => 'E', numrows => 1000000, numblks => 10000, avgrlen => 124);
PL/SQL procedure successfully completed.
scott> exec dbms_stats.set_index_stats(ownname => 'SCOTT', indname => 'E_DEPTNO', numrows => 1000000, numlblks => 1000);
PL/SQL procedure successfully completed.
scott>scott>exec dbms_stats.set_table_stats(ownname => 'SCOTT', tabname => 'D', numrows => 1000000,numblks => 10000 , avgrlen => 124);
PL/SQL procedure successfully completed.
scott>scott>select e.ename,d.dname from e, d where e.deptno=d.deptno;
Execution Plan
----------------------------------------------------------
Plan hash value: 3723262532
--------------------------------------------------------------------------------
---------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| T
ime |
--------------------------------------------------------------------------------
---------
| 0 | SELECT STATEMENT | | 250G| 7683G| 1013K (2)| 0
3:22:40 |
| 1 | NESTED LOOPS | | | | |
|
| 2 | NESTED LOOPS | | 250G| 7683G| 1013K (2)| 0
3:22:40 |
| 3 | TABLE ACCESS FULL | D | 1000K| 12M| 2816 (4)| 0
0:00:34 |
|* 4 | INDEX RANGE SCAN | E_DEPTNO | 32 | | 0 (0)| 0
0:00:01 |
| 5 | TABLE ACCESS BY INDEX ROWID| E | 250K| 4882K| 1 (0)| 0
0:00:01 |
--------------------------------------------------------------------------------
---------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("E"."DEPTNO"="D"."DEPTNO")
Sachin,
Honestly this is the first time ever i understood something so clearly in first reading! You are simply VERY GOOD..you should write a book :-)
in 11G...did you first delete the existing stats and then reset stats artificially on the tables/index and give it a try and please let us know if that makes it work..
Sachin,
You did a tremendously good job in explaining the differences in CBO behaviour based on stats info/clustering info...Thanks!
Majid Khan
Also one more thing, in 11G how is your PGA_TARGET setup, is it large enough to accommodate hash joins? Remember starting from 10G R2 the % usage of PGA per sql/session has changed, i believe the max is 20% of PGA_TARGET. Also check v$pga_stat and see if you are noticing any multipasses...Those were my two cents..:-)
Majid Khan
Sachin,
I got a question..in your test example for nested loops table e will be the driving table (outer loop) and d will be the driven table (inner loop) right? If yes, then you recommended an index on the inner table which is d and no index is created on d...please clarify.
In nested loop example, its "d" (dept) which is driving table and "e" (emp) is driven table. That is why we have index on "e" and not on "d".
Sachin,
By using the approach of artificially populating stats we can simulate production volume of rows/blocks/cluster factoring in the test environment to generate execution plans without in fact having the data volume right? If yes, man this is gooooood...
Thanks
of course we can manipulate the table/index stats to check change in optimizer plan.
In fact, many change the stats in production to "guarantee" a particular type of plan.
Great post Sachin
Thanks a lot sachin, I'm breaking my head about these topics, and you really made it clear...
Thanks a lot again.
Please i need someone to help me, to explain a join index method with example of execution plan of query using index join
Sara, could you explain what do you mean when you refer this term: "join index method"
Thanks a ton for explaining things so clearly.Your approach is awesome, remind me of Tom Kyte.
Thanks Sachin, This is really very nice post
I have deleted the post but i did mention your post as the base article and i wrote on the blog for my own reference but now its according to your will.
great job man, thank you
Great article!
Good Explanation and easy ro understand
This is really very clear. Great work! This post gave some basic idea how the joins will work.
Excellent sachin! keep posting us many more
wonderful explanation, superb post, thanks a lot
Good Post.
As someone mentioned earlier..you should write a book
@Sachin : That is an awsome blog..thanks a lott
Rohit Gholap
Nice Post . it helps me a lot as a starter in this area
Thanks Sachin for the very simple way you used to explain the join methods, it has really helped me a lot.
Great work
Post a Comment