Friday, March 23, 2007

Db file sequential read while doing full table scan?

These days, we are working on data warehouse in which we have a master table which will have 1.5m (approx) rows inserted every half hour and we have few fast refresh materialized view based on it. These mviews have some aggregate functions on it, which makes it a bit complex.

To start the experiment, each mview refreshes used to take some 18-20 mins, which is totally against the business requirement. Then we tried to figure out on why the mview refresh is taking so much time, in spite of dropping all the bitmap indexes on the mview (generally b-map indexes are not good for inserts/updates).

The 10046 trace (level 12) highlighted that there were many “db file sequential reads” on mview because of optimizer using “I_SNAP$_mview” to fetch the rows from mview and merge the rows with that of master table to make the aggregated data for the mview.

Good part of the story is access to master table was quite fast because we used direct load (using sqlldr direct=y) to insert the data in it. When you use direct load to insert the data, oracle maintains the list of rowids added to table in a view called “SYS.ALL_SUMDELTA”. So while doing fast mview refresh, news rows inserted are picked directly from table using the rowids given from ALL_SUMDELTA view and not from Mview log, so this saves time.

Concerned part was still Oracle using I_SNAP$ index while fetching the data from mview and there were many “db file sequential read” waits and it was clearly visible that Oracle waited on sequential read the most. We figured it out that full table scan (which uses scattered read, and multi block read count) was very fast in comparison to index access by running simple test against table. Also the tables are dependent mviews are only for the day. End of the day the master table and mview’s data will be pushed to historical tables and master table and mviews will be empty post midnight.

I gathered the stats of mview and then re-ran the mview refresh, and traced the session, and this time optimizer didn’t use the index which was good news.

Now the challenge was to run the mview stats gathering job every half an hour or induce wrong stats to table/index to ensure mview refresh never uses index access or may be to lock the stats using DBMS_STATS.LOCK_TABLE_STATS.

But we found another solution by creating the mview with “USING NO INDEX” clause. This way “I_SNAP$” index is not created with “CREATE MATERIALIZED VIEW’ command. As per Oracle the “I_SNAP$” index is good for fast refresh but it proved to be reverse for us because our environment is different and the data changes is quite frequent.

Now, we ran the tests again, we loaded 48 slices of data (24 hrs x 2 times within hour) and the results were above expectations. We could load the data with max 3 mins per load of data.

This is not the end of story. In the trace we could see the mview refresh using “MERGE” command and using full table scan access to mview (which we wanted) and rowid range access to master table.

The explain plan looks like:

Rows     Row Source Operation
-------  ---------------------------------------------------
      2  MERGE  SF_ENV_DATA_MV (cr=4598 pr=5376 pw=0 time=47493463 us)
 263052   VIEW  (cr=3703 pr=3488 pw=0 time=24390284 us)
 263052    HASH JOIN OUTER (cr=3703 pr=3488 pw=0 time=24127224 us)
 263052     VIEW  (cr=1800 pr=1790 pw=0 time=14731732 us)
 263052      SORT GROUP BY (cr=1800 pr=1790 pw=0 time=14205624 us)
 784862       VIEW  (cr=1800 pr=1790 pw=0 time=3953958 us)
 784862        NESTED LOOPS  (cr=1800 pr=1790 pw=0 time=3169093 us)
      1         VIEW  ALL_SUMDELTA (cr=9 pr=0 pw=0 time=468 us)
      1          FILTER  (cr=9 pr=0 pw=0 time=464 us)
      1           MERGE JOIN CARTESIAN (cr=9 pr=0 pw=0 time=459 us)
      1            NESTED LOOPS  (cr=6 pr=0 pw=0 time=99 us)
      1             TABLE ACCESS BY INDEX ROWID OBJ$ (cr=3 pr=0 pw=0 time=56 us)
      1              INDEX UNIQUE SCAN I_OBJ1 (cr=2 pr=0 pw=0 time=23 us)(object id 36)
      1             TABLE ACCESS CLUSTER USER$ (cr=3 pr=0 pw=0 time=40 us)
      1              INDEX UNIQUE SCAN I_USER# (cr=1 pr=0 pw=0 time=7 us)(object id 11)
      1            BUFFER SORT (cr=3 pr=0 pw=0 time=354 us)
      1             INDEX RANGE SCAN I_SUMDELTA$ (cr=3 pr=0 pw=0 time=243 us)(object id 158)
      0           NESTED LOOPS  (cr=0 pr=0 pw=0 time=0 us)
      0            INDEX RANGE SCAN I_OBJAUTH1 (cr=0 pr=0 pw=0 time=0 us)(object id 103)
      0            FIXED TABLE FULL X$KZSRO (cr=0 pr=0 pw=0 time=0 us)
      0           FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us)
 784862         TABLE ACCESS BY ROWID RANGE SF_ENV_SLICE_DATA (cr=1791 pr=1790 pw=0 time=2383760 us)
 708905     MAT_VIEW ACCESS FULL SF_ENV_DATA_MV (cr=1903 pr=1698 pw=0 time=6387829 us)

You can see the access pattern as above.

Interesting twist in the story is when I saw the wait events in trace file.

  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  db file sequential read                      2253        0.74          7.73
  db file scattered read                        240        1.05          6.77
  log file switch completion                 6        0.98          3.16
  log file switch                                    8        0.98          2.47
  rdbms ipc reply                                 6        0.00          0.00
  log buffer space                                3        0.42          0.61

Again, even when we are doing full table scan, there are “db file sequential reads”?

To confirm I opened the raw trace file (before tkprof), and checked the obj# on sequential read wait event, it was the mview (SF_ENV_DATA_MV) !! and there were many. To further investigate I checked if there were any scattered reads to mview or not. I found there were scattered reads but there were many sequential reads also on which Oracle waited more than that of scattered read which did most of the data fetching.

After giving some thought, I realized that we created the mviews without storage clause, which means Oracle created the mview with default storage clause.

So assuming there are 17 blocks in an mview (container table) extent and Multi block read count is 16, Oracle will use scattered read mechanism (multiple blocks) to read the first 16 blocks and for the rest 1 it will use sequential read mechanism (one block), so you will find many sequential reads wait events sandwiched between scattered reads. To overcome this we created the mview with larger extent sizes and also multiple of MBCR (multi block read count).

Also, another cause of sequential read is chained or migrated rows, if your mview (or table) rows are migrated, the pointer to the next row is maintained in old (original) block, which will always be read by a single i/o call i.e. by sequential read.You can check the count of chained rows using DBA_TABLES.CHAIN_CNT after analysing the table . So to overcome this, we created the mview with genuine pctfree so when the merge runs (as a part of mview refresh) and updates few rows, the rows are not moved to a different block and hence avoiding sequential read.


  1. Mview creation with “USING NO INDEX” does not create “I_SNAP$” index which sometimes help in fast refresh when the data changes are quite frequent and you cannot afford to collect stats after every few mins.
  2. Create mview with storage clause suiting to your environment. Default extent sizes may not be always good.
  3. PCTFREE can be quite handy to avoid sequential reads and extra block read.

Saturday, March 3, 2007

Optimizer_mode – ALL_ROWS or FIRST_ROWS?

Out of all Oracle RDBMS modules, optimizer code is actually the most complicated code and different optimizer modes seem like jack while lifting your car in case of a puncture.

This paper focuses on how optimizer behaves differently when you have optimizer mode set to ALL_ROWS or FIRST_ROWS.

Possible values for optimizer_mode = choose/ all_rows/ first_rows/ first_rows[n]

By default, the value of optimizer_mode is CHOOSE which basically means ALL_ROWS (if statistics on underlying tables exist) else RULE (if there are no statistics on underlying tables). So it is very important to have statistics collected on your tables on regular intervals or else you are living in Stone Age.

FIRST_ROWS and ALL_ROWS are both cost based optimizer features. You may use them according to their requirement.


In simple terms it ensures best response time of first few rows (n rows).

This mode is good for interactive client-server environment where server serves first few rows and by the time user scroll down for more rows, it fetches other. So user feels that he has been served the data he requested, but in reality the request is still pending and query is still fetching the data in background.

Best example for this is toad, if you click on data tab, it instantaneously start showing you data and you feel toad is faster than sqlplus, but the fact is if you scroll down, you will see the query is still running.

Ok, let us simulate this on SQLPLUS

Create a table and index over it:

SQL> create table test as select * from all_objects;

Table created.

SQL> create index test_in on test(object_type);

Index created.

SQL> exec dbms_stats.gather_table_stats(‘SAC’,'TEST')

PL/SQL procedure successfully completed.

SQL> select count(*) from test;


SQL> select count(*) from test where object_type='JAVA CLASS';


You see out of almost 38k records, 15k are of JAVA class. And now if you select the rows having object_type=’JAVA_CLASS’, it should not use index as almost half of the rows are JAVA_CLASS. It will be foolish of optimizer to read the index first and then go to table.

Check out the Explain plans

SQL> set autotrace traceonly exp
SQL> select * from test where object_type='JAVA CLASS';

Execution Plan
Plan hash value: 1357081020

| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 1001 | 94094 | 10 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TEST | 1001 | 94094 | 10 (0)| 00:00:01 |

As you see above, optimizer has not used Index we created on this table.

Now use FIRST_ROWS hint:

SQL> select /*+ FIRST_ROWS*/ * from test where object_type='JAVA CLASS';

Execution Plan
Plan hash value: 3548301374

| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 14662 | 1345K| 536 (1)| 00:00:07 |
| 1 | TABLE ACCESS BY INDEX ROWID| TEST | 14662 | 1345K| 536 (1)| 00:00:07 |
|* 2 | INDEX RANGE SCAN | TEST_IN | 14662 | | 43 (3)| 00:00:01 |

In this case, optimizer has used the index.

Q> Why?

Ans> Because you wanted to see first few rows quickly. So, following your instructions oracle delivered you first few rows quickly using index and later delivering the rest.

See the difference in cost, although the response time (partial) of second query was faster but resource consumption was high.

But that does not mean that this optimizer mode is bad. As I said this mode may be good for interactive client-server model. In most of OLTP systems, where users want to see data fast on their screen, this mode of optimizer is very handy.

Important facts about FIRST_ROWS

  1. It gives preference to Index scan Vs Full scan (even when index scan is not good).
  2. It prefers nested loop over hash joins because nested loop returns data as selected (& compared), but hash join hashes one first input in hash table which takes time.
  3. Cost of the query is not the only criteria for choosing the execution plan. It chooses plan which helps in fetching first rows fast.
  4. It may be a good option to use this in an OLTP environment where user wants to see data as early as possible.


In simple terms, it means better throughput

While FIRST_ROWS may be good in returning first few rows, ALL_ROWS ensures the optimum resource consumption and throughput of the query. In other words, ALL_ROWS is better to retrieve the last row first.

In above example while explaining FIRST_ROWS, you have already seen how efficient ALL_ROWS is.

Important facts about ALL_ROWS

  1. ALL_ROWS considers both index scan and full scan and based on their contribution to the overall query, it uses them. If Selectivity of a column is low, optimizer may use index to fetch the data (for example ‘where employee_code=7712’), but if selectivity of column is quite high ('where deptno=10'), optimizer may consider doing Full table scan. With ALL_ROWS, optimizer has more freedom to its job at its best.
  2. Good for OLAP system, where work happens in batches/procedures. (While some of the report may still use FIRST_ROWS depending upon the anxiety level of report reviewers)
  3. Likes hash joins over nested loop for larger data sets.


Cost based optimizer gives you flexibility to choose response time or throughput. So use them based on your business requirement.

Friday, March 2, 2007

Nested loops, Hash join and Sort Merge joins – difference?

Nested loop (loop over loop)

In this algorithm, an outer loop is formed which consists of few entries and then for each entry, and inner loop is processed.


Select tab1.*, tab2.* from tabl, tab2 where tabl.col1=tab2.col2;

It is processed like:

For i in (select * from tab1) loop
For j in (select * from tab2 where col2=i.col1) loop
Display results;
End loop;
End loop;

The Steps involved in doing nested loop are:

a) Identify outer (driving) table

b) Assign inner (driven) table to outer table.

c) For every row of outer table, access the rows of inner table.

In execution plan it is seen like this:


When optimizer uses nested loops?

Optimizer uses nested loop when we are joining tables containing small number of rows with an efficient driving condition. It is important to have an index on column of inner join table as this table is probed every time for a new value from outer table.

Optimizer may not use nested loop in case:

  1. No of rows of both the table is quite high
  2. Inner query always results in same set of records
  3. The access path of inner table is independent of data coming from outer table.

Note: You will see more use of nested loop when using FIRST_ROWS optimizer mode as it works on model of showing instantaneous results to user as they are fetched. There is no need for selecting caching any data before it is returned to user. In case of hash join it is needed and is explained below.

Hash join

Hash joins are used when the joining large tables. The optimizer uses smaller of the 2 tables to build a hash table in memory and the scans the large tables and compares the hash value (of rows from large table) with this hash table to find the joined rows.

The algorithm of hash join is divided in two parts

  1. Build a in-memory hash table on smaller of the two tables.
  2. Probe this hash table with hash value for each row second table

In simpler terms it works like

Build phase

For each row RW1 in small (left/build) table loop
Calculate hash value on RW1 join key
Insert RW1 in appropriate hash bucket.
End loop;

Probe Phase

For each row RW2 in big (right/probe) table loop
Calculate the hash value on RW2 join key
For each row RW1 in hash table loop
If RW1 joins with RW2
Return RW1, RW2
End loop;
End loop;

When optimizer uses hash join?

Optimizer uses has join while joining big tables or big fraction of small tables.

Unlike nested loop, the output of hash join result is not instantaneous as hash joining is blocked on building up hash table.

Note: You may see more hash joins used with ALL_ROWS optimizer mode, because it works on model of showing results after all the rows of at least one of the tables are hashed in hash table.

Sort merge join

Sort merge join is used to join two independent data sources. They perform better than nested loop when the volume of data is big in tables but not as good as hash joins in general.

They perform better than hash join when the join condition columns are already sorted or there is no sorting required.

The full operation is done in two parts:

  • Sort join operation

get first row RW1 from input1
get first row RW2 from input2.

  • Merge join operation

while not at the end of either input loop
if RW1 joins with RW2
get next row R2 from input 2
return (RW1, RW2)
else if RW1 < style=""> get next row RW1 from input 1
get next row RW2 from input 2
end loop

Note: If the data is already sorted, first step is avoided.

Important point to understand is, unlike nested loop where driven (inner) table is read as many number of times as the input from outer table, in sort merge join each of the tables involved are accessed at most once. So they prove to be better than nested loop when the data set is large.

When optimizer uses Sort merge join?

a) When the join condition is an inequality condition (like <, <=, >=). This is because hash join cannot be used for inequality conditions and if the data set is large, nested loop is definitely not an option.

b) If sorting is anyways required due to some other attribute (other than join) like “order by”, optimizer prefers sort merge join over hash join as it is cheaper.

Note: Sort merge join can be seen with both ALL_ROWS and FIRST_ROWS optimizer hint because it works on a model of first sorting both the data sources and then start returning the results. So if the data set is large and you have FIRST_ROWS as optimizer goal, optimizer may prefer sort merge join over nested loop because of large data. And if you have ALL_ROWS as optimizer goal and if any inequality condition is used the SQL, optimizer may use sort-merge join over hash join