Monday, November 10, 2014

reverse key index: the remedy is worse than the symptoms

In general, reverse key index mitigates data block contention (buffer busy waits) when inserting monotonically increasing key into right-most index by multiple concurrent sessions. The disadvantage is its inability to support range scan.

Based on AWR report, this Blog will try to list the main checking points when using reverse key index, and shows how to find the acceptable trade-off between benefits and drawbacks. The whole discussion stems from author's experience in single instance DB (non-RAC).

1.Top 5 Timed Foreground Events

When some events like:
 enq: TX - index contention
 buffer busy waits
 read by other session

appear in Top Events, reverse key index is hastily resorted.

However "db file sequential read" should be checked after experimenting reverse key index since all the original inserts of right-most single index block are now spread over many blocks. One should try to balance the fast logical buffer activity against slow physical disk access.

2. Time Model Statistics

Since more blocks are involved with reverse key index, DBWR is more active, and therefore:
 background elapsed time
 background cpu time

are augmented.

Higher "db file parallel write" and "DBWR checkpoint buffers written" can also be observed,

3. SQL ordered by Gets

The insert statement generates a lower "Buffer Gets" in case of reverse key index.

4. SQL ordered by Reads

Reverse key index incurs a higher "Physical Reads" for the insert statement, almost 1 "Reads per Exec" since each insert touches a single different block.

Moreover, the queries using the index also cause higher "Physical Reads" due to the wide spreading of conceptually adjacent indexes, but physically isolated.

5. Segment Statistics

The absolute figures of buffer contentions and physical activities can be extracted by looking:
 Segments by Logical Reads
 Segments by Row Lock Waits
 Segments by Buffer Busy Waits

 Segments by Physical Reads
 Segments by Physical Writes

Based on them, a quick magnitude comparison of IO and Memory associated with system statistics can give an approximate assessment of reverse key index.

Discussion


D-1. Expert Oracle Database Architecture said:

 One way to improve the performance of the PL/SQL implementation with a regular index would be to introduce a small wait. That would reduce the contention on the right-hand side of the index and increase overall performance (Page 439)

 reverse key index (designed for Oracle RAC) (Page 491)


Usually, sequence insert in OLTP system is inherently operated with a slight delay due to the interactive nature of OLTP, whereas Batch processing can be impeded by heavy buffer contention of index blocks.

D-2Introduction To Reverse Key Indexes: Part III (A Space Oddity) and
        Oracle B-Tree Index Internals:Rebuilding The Truth wrote:

If the "right-most" block is filled by the maximum current value in the index, Oracle performs 90-10 block splits meaning that full index blocks are left behind in the index structure.

I prefer to call them 99-1 block splits as 90-10 is misleading.

Generally reverse key index tends to use 50-50 Split, which costs more block creates and touches.
In principle, it shifts single block buffer contention to multiple accesses of separate blocks.

D-3. Oracle MOS has a cautious Doc:
        Using Reverse Key Indexes For Oracle 11.2.0.2 (Doc ID 1352477.1)

 However, it must be noted that : Now the entire index had better be in the buffer cache whereas before - only the hot right hand side needed to be in the cache for efficient inserts. If the index cannot fit into the cache we might well have turned a buffer busy wait into a physical IO wait which could be even worse (the remedy is worse than the symptoms).

The reminiscent last remark is voted as the title of this Blog.

D-4. Oracle MOS Doc:
 Troubleshooting 'enq: TX - index contention' Waits in a RAC Environment. (Doc ID 873243.1)
explains "enq: TX - index contention":

 The reason for this is the index block splits while inserting a new row into the index. The transactions will have to wait for TX lock in mode 4, until the session that is doing the block splits completes the operations.

and suggests to look Instance Activity Stats:  
  root node splits
 branch node splits
 leaf node splits
 leaf node 90-10 splits


In a DB, reverse key index can be found by:
 select * from dba_indexes where index_type = 'NORMAL/REV';

Thursday, November 6, 2014

Low High Water Mark, High High Water Mark, and Parallel Query

Recently we saw a 4 hours running parallel SELECT:

 select /*+  parallel(t,2) parallel_index(t,2) dbms_stats cursor_sharing_exact
   use_weak_name_resl dynamic_sampling(0) no_monitoring no_substrb_pad 
   */count(*), count("ID"), count(distinct "ID"),
   sum(sys_op_opnsize("ID")), substrb(dump(min("ID"),16,0,32),1,120),
   substrb(dump(max("ID"),16,0,32),1,120),
   ...
 from "X"."TABLE_1" sample (  5.0000000000)  t;


which is generated when executing:
 dbms_stats.gather_table_stats('X', 'TABLE_1', estimate_percent=>5, degree=>2);
on both Oracle 11.2.0.3.0 and 11.2.0.4.0.

The table is about 160GB and contains 20m blocks (8K per block) in ASSM LMT Managed Tablespaces, all the columns are number or varchar2.

AWR report shows the SELECT statement statistics:
   Buffer Gets          14,805,983,166
 Physical Reads           40,474,817
 UnOptimized Read Reqs     1,881,624


TABLE_1 Segment Statistics:
 Segments by Logical Reads        16,312,997,632
 Segments by Physical Reads           38,580,663
 Segments by Direct Physical Reads    38,557,265
 Segments by Table Scans                 263,660
 Segments by Buffer Busy Waits            53,492


At first, one surprising big "Table Scans" indicates certainly abnormal behavior.

Huge number of "Buffer Gets" and double "Physical Reads" also explain the slow running.

"Table Scans" for Parallel Query is the Execs of "TABLE ACCESS FULL" below "PX BLOCK ITERATOR" in xplan, since "PX BLOCK ITERATOR" at first breaks up the table into granules which are then dispatched to each PX slaves. In the case of SAMPLE select, it is the Execs of "TABLE ACCESS SAMPLE BY ROWID RANGE".

The number of granules processed by each PX slave is controlled by two hidden parameters:
 _px_min_granules_per_slave: minimum number of rowid range granules to generate per slave default  13
 _px_max_granules_per_slave: maximum number of rowid range granules to generate per slave default 100


For instance, parallel(t,2) contributes minimum 26 (2*13) "Table Scans", and maximum 200 (2*100);
and parallel(t,16) generates minimum 208 (16*13) "Table Scans", and maximum 1600 (16*100);

Therefore when increasing degree from 2 to 16,
 dbms_stats.gather_table_stats(null, 'TABLE_1', estimate_percent=>5, degree=>16);
will take 32 (8*4) hours. That is a counterintuitive behavior when using parallel query.

The "Table Scans" can be monitored by:
     select * from V$segment_statistics
   where object_name = 'TABLE_1' and statistic_name = 'segment scans';

 
The hidden parameters can be changed by, for example:
 alter system set "_px_min_granules_per_slave"=19;
 
 
Searching MOS, we found:


 Bug 6474009 - On ASSM tablespaces Scans using Parallel Query can be slower than in 9.2 (Doc ID 6474009.8)


which says:

 Scans using Parallel Query on ASSM tablespaces on 10.2 can be slower compared to 9.2 when there is a wide separation between the High and Low High Water Mark.  There can be High contention and/or a large number of accesses to bitmap blocks in PQ slaves during extent-based scans
 of table/index segments.

 AWR reports may show "read by other session" as one of the top wait events and Section "Buffer Wait Statistics" may report "1st level bmb" as the most important wait class.

 The difference between the High and Low High Water Mark can be identified by running procedure DBMS_SPACE_ADMIN.ASSM_SEGMENT_SYNCHWM with parameter check_only=1.


The AWR report confirms the above notice:
 read by other session  Waits=11,302,432
 1st level bmb          Waits=98,112


and calling:
 DBMS_SPACE_ADMIN.ASSM_SEGMENT_SYNCHWM('X','TABLE_1','TABLE',NULL,1) returns 1.

MOS also gives a Workaround:
   Execute procedure DBMS_SPACE_ADMIN.ASSM_SEGMENT_SYNCHWM with check_only=0 to synchronize the gap between the High and Low High Water Mark.
   IMPORTANT: A patch for bug 6493013 must be installed before running that procedure.

  
DBMS_SPACE_ADMIN.ASSM_SEGMENT_SYNCHWM only signals the existence of the gap,
but does not report the size of the gap. However, the width between low HWM and the HWM
determines the performance of full table scan (see Section "Segment Space and the High Water Mark" later in this Blog).

To find the exact gap size, we can dump segment header by:
 alter system dump datafile 94 block 2246443; 

Here the output:
  -----------------------------------------------------------------
  Extent Header:: spare1: 0      spare2: 0      #extents: 1242809 #blocks: 19884944
                  last map  0x9d178f31  #maps: 2446   offset: 2720 
      Highwater::  0x9d179f90  ext#: 1242808 blk#: 16     ext size: 16   
  #blocks in seg. hdr's freelists: 0    
  #blocks below: 19804642
  mapblk  0x9d178f31  offset: 441  
                   Unlocked
  --------------------------------------------------------
  Low HighWater Mark :
      Highwater::  0x1e243af9  ext#: 448588 blk#: 16     ext size: 16   
  #blocks in seg. hdr's freelists: 0    
  #blocks below: 7177417
  mapblk  0x1e242fd9  offset: 225  
  Level 1 BMB for High HWM block: 0x9d179910
  Level 1 BMB for Low HWM block: 0x1aace0d9
  --------------------------------------------------------


We can see there are 19,804,642 blocks below Highwater: 0x9d179f90, but only 7,177,417 below Low HighWater Mark: 0x1e243af9. So 12,627,225 (19804642-7177417) are between (64%).

The distance between Level 1 BMB High HWM block and Low HWM block is 2,188,032,055
(0x9d179910 - 0x1aace0d9 = 2635569424 - 447537369 = 2188032055).

After synchronizing by:
  DBMS_SPACE_ADMIN.ASSM_SEGMENT_SYNCHWM('X','TABLE_1','TABLE',NULL,0)
the gap is gotten closed as showed by anew header dump:
  
  -----------------------------------------------------------------
  Extent Header:: spare1: 0      spare2: 0      #extents: 1242809 #blocks: 19884944
                  last map  0x9d178f31  #maps: 2446   offset: 2720 
      Highwater::  0x9d179f90  ext#: 1242808 blk#: 16     ext size: 16   
  #blocks in seg. hdr's freelists: 0    
  #blocks below: 19804642
  mapblk  0x9d178f31  offset: 441  
                   Unlocked
  --------------------------------------------------------
  Low HighWater Mark :
      Highwater::  0x9d179f90  ext#: 1242808 blk#: 16     ext size: 16   
  #blocks in seg. hdr's freelists: 0    
  #blocks below: 19884944
  mapblk  0x9d178f31  offset: 441  
  Level 1 BMB for High HWM block: 0x9d179910
  Level 1 BMB for Low HWM block: 0x9d179910
  --------------------------------------------------------
  
 
In "Low HighWater Mark" section:
     -. "#blocks below: 7177417" is updated to "#blocks below: 19884944" (even bigger than 19804642).
     -. "Level 1 BMB for Low HWM block: 0x1aace0d9" is changed to "Level 1 BMB for Low HWM block: 0x9d179910",
               which got the same block id (0x9d179910) as "Level 1 BMB for High HWM block: 0x9d179910".

by chance, we also notice that the gap between the High and Low High Water Mark can be synchronized by:
 select /*+ parallel(t, 2) */ count(*) from TABLE_1 t;


Segment Space and the High Water Mark


Oracle has a detail description about High HWM (or simply HWM) and Low HWM
(see Segment Space and the High Water Mark):

 Suppose that a transaction inserts rows into the segment. The database must allocate a group of blocks to hold the rows. The allocated blocks fall below the HWM. The database formats a bitmap block in this group to hold the metadata,  but does not preformat the remaining blocks in the group. 

 The low HWM is important in a full table scan. Because blocks below the HWM are formatted only when used, some blocks could be unformatted, as in Figure 12-25. For this reason, the database reads the bitmap block to obtain the location of the low HWM. The database reads all blocks up to the low HWM because they are known to be formatted, and then carefully reads only the formatted blocks between the low HWM and the HWM.


We can see that ASSM uses bitmap to track the state of blocks in the segment, and introduces LHWM and HHWM to manage segment space.

Low High-Water Mark (LHWM): Like the old High Water Mark.
    All blocks below this block have been formatted for the table.

High High-Water Mark (HHWM)
    All blocks above this block have not been formatted.

The LHWM and the HHWM may not be the same value depending on how the bitmap tree was traversed. If different the blocks between them may or may not be formatted for use. The HHWM is necessary so that direct load operation can guarantee contiguous unformatted blocks.

During parallel query, a select on v$bh shows more than 100 CR versions of TABLE_1 segment header block. It could be triggered by bitmap block(BMB) checking.

When looking v$session.event, we saw a heavy activity on:
 "db file parallel read"

AWR report also shows:
 db file parallel read:  202,467

Oracle says that this event happens during recovery or buffer prefetching. But in this case, it could be the multiple BMB blocks' reads from different files in which BMB blocks are located for each group.