Recently we saw a 4 hours running parallel SELECT:
select /*+ parallel(t,2) parallel_index(t,2) dbms_stats cursor_sharing_exact
use_weak_name_resl dynamic_sampling(0) no_monitoring no_substrb_pad
*/count(*), count("ID"), count(distinct "ID"),
sum(sys_op_opnsize("ID")), substrb(dump(min("ID"),16,0,32),1,120),
substrb(dump(max("ID"),16,0,32),1,120),
...
from "X"."TABLE_1" sample ( 5.0000000000) t;
which is generated when executing:
dbms_stats.gather_table_stats('X', 'TABLE_1', estimate_percent=>5, degree=>2);
on both Oracle 11.2.0.3.0 and 11.2.0.4.0.
The table is about 160GB and contains 20m blocks (8K per block) in ASSM LMT Managed Tablespaces, all the columns are number or varchar2.
AWR report shows the SELECT statement statistics:
Buffer Gets 14,805,983,166
Physical Reads 40,474,817
UnOptimized Read Reqs 1,881,624
TABLE_1 Segment Statistics:
Segments by Logical Reads 16,312,997,632
Segments by Physical Reads 38,580,663
Segments by Direct Physical Reads 38,557,265
Segments by Table Scans 263,660
Segments by Buffer Busy Waits 53,492
At first, one surprising big "Table Scans" indicates certainly abnormal behavior.
Huge number of "Buffer Gets" and double "Physical Reads" also explain the slow running.
"Table Scans" for Parallel Query is the Execs of "TABLE ACCESS FULL" below "PX BLOCK ITERATOR" in xplan, since "PX BLOCK ITERATOR" at first breaks up the table into granules which are then dispatched to each PX slaves. In the case of SAMPLE select, it is the Execs of "TABLE ACCESS SAMPLE BY ROWID RANGE".
The number of granules processed by each PX slave is controlled by two hidden parameters:
_px_min_granules_per_slave: minimum number of rowid range granules to generate per slave default 13
_px_max_granules_per_slave: maximum number of rowid range granules to generate per slave default 100
For instance, parallel(t,2) contributes minimum 26 (2*13) "Table Scans", and maximum 200 (2*100);
and parallel(t,16) generates minimum 208 (16*13) "Table Scans", and maximum 1600 (16*100);
Therefore when increasing degree from 2 to 16,
dbms_stats.gather_table_stats(null, 'TABLE_1', estimate_percent=>5, degree=>16);
will take 32 (8*4) hours. That is a counterintuitive behavior when using parallel query.
The "Table Scans" can be monitored by:
select * from V$segment_statistics
where object_name = 'TABLE_1' and statistic_name = 'segment scans';
The hidden parameters can be changed by, for example:
alter system set "_px_min_granules_per_slave"=19;
Searching MOS, we found:
Bug 6474009 - On ASSM tablespaces Scans using Parallel Query can be slower than in 9.2 (Doc ID 6474009.8)
which says:
Scans using Parallel Query on ASSM tablespaces on 10.2 can be slower compared to 9.2 when there is a wide separation between the High and Low High Water Mark. There can be High contention and/or a large number of accesses to bitmap blocks in PQ slaves during extent-based scans
of table/index segments.
AWR reports may show "read by other session" as one of the top wait events and Section "Buffer Wait Statistics" may report "1st level bmb" as the most important wait class.
The difference between the High and Low High Water Mark can be identified by running procedure DBMS_SPACE_ADMIN.ASSM_SEGMENT_SYNCHWM with parameter check_only=1.
The AWR report confirms the above notice:
read by other session Waits=11,302,432
1st level bmb Waits=98,112
and calling:
DBMS_SPACE_ADMIN.ASSM_SEGMENT_SYNCHWM('X','TABLE_1','TABLE',NULL,1) returns 1.
MOS also gives a Workaround:
Execute procedure DBMS_SPACE_ADMIN.ASSM_SEGMENT_SYNCHWM with check_only=0 to synchronize the gap between the High and Low High Water Mark.
IMPORTANT: A patch for bug 6493013 must be installed before running that procedure.
DBMS_SPACE_ADMIN.ASSM_SEGMENT_SYNCHWM only signals the existence of the gap,
but does not report the size of the gap. However, the width between low HWM and the HWM
determines the performance of full table scan (see Section "Segment Space and the High Water Mark" later in this Blog).
To find the exact gap size, we can dump segment header by:
alter system dump datafile 94 block 2246443;
Here the output:
-----------------------------------------------------------------
Extent Header:: spare1: 0 spare2: 0 #extents: 1242809 #blocks: 19884944
last map 0x9d178f31 #maps: 2446 offset: 2720
Highwater:: 0x9d179f90 ext#: 1242808 blk#: 16 ext size: 16
#blocks in seg. hdr's freelists: 0
#blocks below: 19804642
mapblk 0x9d178f31 offset: 441
Unlocked
--------------------------------------------------------
Low HighWater Mark :
Highwater:: 0x1e243af9 ext#: 448588 blk#: 16 ext size: 16
#blocks in seg. hdr's freelists: 0
#blocks below: 7177417
mapblk 0x1e242fd9 offset: 225
Level 1 BMB for High HWM block: 0x9d179910
Level 1 BMB for Low HWM block: 0x1aace0d9
--------------------------------------------------------
We can see there are 19,804,642 blocks below Highwater: 0x9d179f90, but only 7,177,417 below Low HighWater Mark: 0x1e243af9. So 12,627,225 (19804642-7177417) are between (64%).
The distance between Level 1 BMB High HWM block and Low HWM block is 2,188,032,055
(0x9d179910 - 0x1aace0d9 = 2635569424 - 447537369 = 2188032055).
After synchronizing by:
DBMS_SPACE_ADMIN.ASSM_SEGMENT_SYNCHWM('X','TABLE_1','TABLE',NULL,0)
the gap is gotten closed as showed by anew header dump:
-----------------------------------------------------------------
Extent Header:: spare1: 0 spare2: 0 #extents: 1242809 #blocks: 19884944
last map 0x9d178f31 #maps: 2446 offset: 2720
Highwater:: 0x9d179f90 ext#: 1242808 blk#: 16 ext size: 16
#blocks in seg. hdr's freelists: 0
#blocks below: 19804642
mapblk 0x9d178f31 offset: 441
Unlocked
--------------------------------------------------------
Low HighWater Mark :
Highwater:: 0x9d179f90 ext#: 1242808 blk#: 16 ext size: 16
#blocks in seg. hdr's freelists: 0
#blocks below: 19884944
mapblk 0x9d178f31 offset: 441
Level 1 BMB for High HWM block: 0x9d179910
Level 1 BMB for Low HWM block: 0x9d179910
--------------------------------------------------------
In "Low HighWater Mark" section:
-. "#blocks below: 7177417" is updated to "#blocks below: 19884944" (even bigger than 19804642).
-. "Level 1 BMB for Low HWM block: 0x1aace0d9" is changed to "Level 1 BMB for Low HWM block: 0x9d179910",
which got the same block id (0x9d179910) as "Level 1 BMB for High HWM block: 0x9d179910".
by chance, we also notice that the gap between the High and Low High Water Mark can be synchronized by:
select /*+ parallel(t, 2) */ count(*) from TABLE_1 t;
Segment Space and the High Water Mark
Oracle has a detail description about High HWM (or simply HWM) and Low HWM
(see
Segment Space and the High Water Mark):
Suppose that a transaction inserts rows into the segment. The database must allocate a group of blocks to hold the rows. The allocated blocks fall below the HWM. The database formats a bitmap block in this group to hold the metadata, but does not preformat the remaining blocks in the group.
The low HWM is important in a full table scan. Because blocks below the HWM are formatted only when used, some blocks could be unformatted, as in Figure 12-25. For this reason, the database reads the bitmap block to obtain the location of the low HWM. The database reads all blocks up to the low HWM because they are known to be formatted, and then carefully reads only the formatted blocks between the low HWM and the HWM.
We can see that ASSM uses bitmap to track the state of blocks in the segment, and introduces LHWM and HHWM to manage segment space.
Low High-Water Mark (LHWM): Like the old High Water Mark.
All blocks below this block have been formatted for the table.
High High-Water Mark (HHWM)
All blocks above this block have not been formatted.
The LHWM and the HHWM may not be the same value depending on how the bitmap tree was traversed.
If different the blocks between them may or may not be formatted for use.
The HHWM is necessary so that direct load operation can guarantee contiguous unformatted blocks.
During parallel query, a select on v$bh shows more than 100 CR versions of TABLE_1 segment header block. It could be triggered by bitmap block(BMB) checking.
When looking v$session.event, we saw a heavy activity on:
"db file parallel read"
AWR report also shows:
db file parallel read: 202,467
Oracle says that this event happens during recovery or buffer prefetching. But in this case, it could be the multiple BMB blocks' reads from different files in which BMB blocks are located for each group.