Sunday, January 7, 2018

Blog List

Oracle Latch Misses, Spin_gets, Sleeps and Recurrent_Misses

After the correction of Blog: Is latch misses statistic gathered or deduced ? I re-examined a few hourly AWR reports in one heavily loaded system (Oracle 12.1.0.2.0).


1. Latch Statistics and Recurrent_Misses


AWR Top Events Section looks like:

Top 10 Foreground Events by Total Wait Time

Event Waits Total Wait Time (sec) Avg wait (ms) % DB time Wait Class
latch: cache buffers chains 1,078,528 291.7K 270.50 37.6 Concurrency
latch: row cache objects 423,293 78.2K 184.86 10.1 Concurrency

AWR "Latch Sleep Breakdown" Section shows:

Latch Sleep Breakdown

  • ordered by misses desc
  • Latch Name Get Requests Misses Sleeps Spin Gets
    cache buffers chains 15,134,617,472 176,939,147 1,040,507 314,665,205
    row cache objects 101,225,008 4,075,626 429,314 3,673,739
    Result Cache: RC Latch 7,876,403 44,622 25,380 19,440

    Look "cache buffers chains" (CBC) statistics:
         Misses    = 176,939,147
         Sleeps    =   1,040,507
         Spin Gets = 314,665,205
    
    since Spin_Gets(314,665,204) > Misses(176,939,142), so there exist recurrent Spin_Gets.

    Every unsuccessful Spin_Get will trigger a Sleep (or each Sleep follows a Spin_Get), every unsuccessful Sleep will result in a successive longer Sleep.

    Latch Sleep is started with an initial duration of 10ms (or 1ms), and progressively elevated after every unsuccessful Sleep (See later discussion of V$LATCH.sleep[1-11], V$EVENT_HISTOGRAM.wait_time_milli, and pseudo-code in Book: Oracle Performance Firefighting Page 79).

    Suppose initial duration being 10ms, Avg wait (ms) = 270.50 indicates that there exist Sleeps, which are not initial Sleeps, hence recurrent Sleeps.

    For OS scheduler Time Slice of 10ms, Avg wait (ms) = 270.50 signifies that majority of Sleeps spans more than 20 Ticks, probably recurrent Sleeps (also possible not awakened from waitqueue, i.e., not scheduled into runqueue).

    The sum of recurrent_spin_gets and recurrent_sleeps can be computed by:
         spin_gets + sleeps - misses = 314,665,205 + 1,040,507 - 176,939,147 = 138,766,565 > 0
    
    Therefore, above formula could be refined as:
           sleeps + spin_gets – misses 
         = recurrent_misses 
         = recurrent_sleeps + recurrent_spin_gets 
    
    In the above example, recurrent_misses / Misses amounts to 78.4% (138,766,565 / 176,939,147), one serious case of latch misses.


    2. Observations

    (1). Latch Sleep Breakdown Table shows:
          "cache buffers chains": Spin_Gets(314,665,204) > Misses(176,939,142)
       So not every Spin_Get is a success Get, and each Miss can trigger multi Spin_Gets.
       
    (2). Sleeps can be bigger than Spin_Gets, for example, 
          "Result Cache: RC Latch": Sleeps(25,385) > Spin_Gets(19,446)
       
    (3). Usually when one Latch has very slight contention, 
           misses = sleeps + spin_gets
       It means that each Miss is resolved either by one Spin_Get, or by one Sleep.
       This also indicates that immediately after each Sleep (OS Process waking up and rescheduled), Oracle makes one Latch Get.
       
    (4). Top 10 Foreground Events Table shows:
          Latch Sleeps is approximate to Event Waits, i.e V$LATCH.sleeps = V$SYSTEM_EVENT.total_waits.
       It signifies that only Latch Sleeps is counted as Waits because Process is on the wait queue. 
       Latch Spin_Gets is not counted because Process is on CPU (runqueue).
    
    By the above observations, it is not clear how Oracle makes the counting of Misses, Sleeps and Spin_Gets in willing-to-wait mode. And it is hard to reconstruct a pseudo-code to match above countings.


    3. Discussions

    (1). Oracle session's Response Time is made of Service Time and Queue Time. Spin_Gets is counted as Service Time since it is on CPU.
       Whereas Sleeps is categorized as Queue Time since it is on waiting. For details, see Book: 
       Oracle Performance Firefighting Page 74-91.
    
    (2). As discussed, generally, Latch Spins is burning CPU; whereas Latch Sleeps yields CPU. Therefore when talking about Latch contentions, 
       it would be necessary to distinguish between Spin_gets and Sleeps. 
       As tested, usually Spin_gets of Latch Misses are caused by frequently concurrent access; whereas Sleeps of Latch Misses are triggered 
       by Invalidations or DML Modifications. 
       
    (3). Oracle V$LATCH SLEEP evaluated, which leaks the algorithm changing.
     
       In Oracle 9i, V$LATCH is documented as:
         SLEEP[1|2|3]           Waits that slept 1 time through 3 times, respectively
         SLEEP4                   Waits that slept 4 or more times
         SLEEP[5|6|7|8|9|10|11] present for compatibility with previous releases of Oracle. No data is accumulated.
         
       Since Oracle 10g, V$LATCH is updated as:
         SLEEP[1|2|3|4|5|6|7|8|9|10|11] deprecated. As a substitute, query V$EVENT_HISTOGRAM for EVENT like latch free or latch:%.                                 
       
       So the original design elevated SLEEP in 11 grades, with 9i, it was reduced to 4, and 10g regulated it in a new approach. 
       
    (4). Suppose OS scheduler Time Slice is 10ms (system clock tick frequency of 100 hertz), 
       Wait Event "Avg wait (ms)"/10 represents the paused ticks before placed back into the runqueue.
       
    (5). There are two sorts of Latches, one has children (see V$LATCH_CHILDREN), for example, "cache buffers chains";
       other one has no children and hence an instance_wide single Latch, for example, "Result Cache: RC Latch".
       Therefore, when comparing Sleeps and Spin_Gets among Latches, the number of children should be taken into account.
       For example, single Latch can serialize whole system.
    
    With following query, we can see number of Blocks protected per CBC Child Latch:
    
    with block_size as (select value b_val from v$parameter where name in ('db_block_size'))
        ,cache_size as (select sum(value) c_val from v$parameter where name in ('db_cache_size', 'db_keep_cache_size', 'db_recycle_cache_size'))
        ,latch_cnt  as (select count(*) cnt from v$latch_children where name in ('cache buffers chains'))
    select b_val, round(c_val/1024/1024) cache_size_mb, l.cnt, ceil(c.c_val/b_val/cnt) block_per_latch
    from block_size b, cache_size c, latch_cnt l;
    

    4. Queries

    -- v$latch
    select name, misses, sleeps, spin_gets
          ,sign((sleeps + spin_gets) - misses) miss_deduced_sign
          ,((sleeps + spin_gets) - misses)     recurrent_misses
          ,trunc((abs((sleeps + spin_gets) - misses) / nullif(misses, 0)) * 100 ,2)*100 recurrent_misses_perc
    from v$latch where misses > 1000
    order by misses desc;
    
    -- v$system_event
    select event, total_waits, round(time_waited_micro/1e6, 2) time_waited_sec, average_wait*1e4 average_wait_micro, e.*
    from v$system_event  e
    where event in ('latch free', 'latch: shared pool', 'latch: row cache objects', 'latch: cache buffers chains')
    order by e.event;
    
    -- v$session_event
    select sid, event, total_waits, round(time_waited_micro/1e6, 2) time_waited_sec, average_wait*1e4 average_wait_micro, e.*
    from v$session_event e
    where e.event in ('latch free', 'latch: shared pool', 'latch: row cache objects', 'latch: cache buffers chains')
     --and sid in (217, 1217)
    order by e.event, e.sid;
    
    -- dba_hist_latch (One can also query dba_hist_latch_parent, dba_hist_latch_children)
    select snap_id End_Snap, end_interval_time Snap_Time, latch_name Latch_Name
          ,gets_d       Get_request
          ,misses_d     Misses
          ,sleeps_d     Sleeps
          ,spin_gets_d  Spin_Gets
          --,v.*
    from (
    select gets-lag(gets) over(partition by latch_name order by e.snap_id) gets_d
          ,misses-lag(misses) over(partition by latch_name order by e.snap_id) misses_d
          ,sleeps-lag(sleeps) over(partition by latch_name order by e.snap_id) sleeps_d
          ,spin_gets-lag(spin_gets) over(partition by latch_name order by e.snap_id) spin_gets_d
          ,h.end_interval_time, e.* 
    from dba_hist_latch e, dba_hist_snapshot h
    where e.snap_id = h.snap_id 
      and e.latch_name in ('Result Cache: RC Latch', 'cache buffers chains', 'row cache objects')
      order by e.snap_id desc
    ) v
    --where snap_id = 1217
    order by snap_id desc, latch_name, gets_d desc;
    
    -- dba_hist_system_event. Similar to AWR "Top 10 Foreground Events by Total Wait Time"
    select snap_id End_Snap, end_interval_time Snap_Time,   event_name Event
          ,total_waits_d                                               Waits
          ,round(time_waited_micro_d/1e6,2)                            Total_Wait_Time_sec
          ,round(time_waited_micro_d/1e3/(nullif(total_waits_d, 0)),2) Wait_Avg_ms
          --,v.*
    from (
    select total_waits-lag(total_waits) over(partition by event_name order by e.snap_id) total_waits_d
          ,total_timeouts-lag(total_timeouts) over(partition by event_name order by e.snap_id) total_timeouts_d
          ,time_waited_micro-lag(time_waited_micro) over(partition by event_name order by e.snap_id) time_waited_micro_d
          ,h.end_interval_time, e.* 
    from dba_hist_system_event e, dba_hist_snapshot h
    where e.snap_id = h.snap_id 
      and e.event_name in ('latch free', 'latch: shared pool', 'latch: row cache objects', 'latch: cache buffers chains')
      order by e.snap_id desc
    ) v
    where snap_id = 1217;
    order by snap_id desc, total_waits_d desc;
    

    Thursday, December 14, 2017

    Oracle DB File Read Access Path and Cost

    Following the discussion of Memory Read (Buffer Cache Read, Logical Read) in last Blog: Oracle Buffer Cache Read Access Path and Cost, now is the time to look into Disk Read (DB File Read, Physical Read).

    We will first test 3 varieties of db file read:
      db file sequential read
      db file scattered read 
      db file parallel read  
    
    and then compare the cost (statistics) reported in 3 places for different access path:
      SQL Trace
      Dtrace
      Oracle View: V$FILESTAT and V$IOSTAT_FILE
    
    Note: All tests are done in Oracle 12.1.0.2 on Solaris.


    1. DB File Read Access Path


    Run 4 variants of Access Path tests with SQl Trace (Event 10046), and at the same time dtace the process (see appended Dtrace Script).


    1.1. single Read


    
    SQL > exec db_file_read_test('single', 1, 333);
    
    -- adjacent rowid, single block read, 'db file sequential read'
    SELECT /*+ single_read */ Y FROM KSUN_TAB T WHERE ROWID = :B1 
    
    call     count       cpu    elapsed       disk      query    current        rows
    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
    Parse        1      0.00       0.00          0          0          0           0
    Execute    333      0.01       0.01          0          0          0           0
    Fetch      333      0.01       0.01        641        333          0         333
    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
    total      667      0.02       0.02        641        333          0         333
    
    Row Source Operation
    ---------------------------------------------------
    TABLE ACCESS BY USER ROWID KSUN_TAB (cr=1 pr=8 pw=0 time=160 us cost=1 size=3513 card=1)
    
      Event waited on                             Times   Max. Wait  Total Waited
      ----------------------------------------   Waited  ----------  ------------
      db file scattered read                         44        0.00          0.00
      db file sequential read                       289        0.00          0.00
     
    ------------------------------ dtrace ------------------------------
      PROBEFUNC      FD  RETURN_SIZE     COUNT
      lseek         260            0        44
      readv         260        65536        44
      pread         260         8192       289
    
      PROBEFUNC      FD  MAX_READ_Blocks
      pread         260                1
      readv         260                8
    
    TOTAL_SIZE = 5251072   , TOTAL_READ_Blocks = 641   , TOTAL_READ_CNT = 333
    
    readv         260
               value  ------------- Distribution ------------- count
                8192 |                                         0
               16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  43
               32768 |@                                        1
               65536 |                                         0
    
    
    pread         260
               value  ------------- Distribution ------------- count
                2048 |                                         0
                4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@          226
                8192 |@@@@@@@@                                 60
               16384 |                                         2
               32768 |                                         0
               65536 |                                         1
              131072 |                                         0
    


    1.2. scattered Read


    
    SQL > exec db_file_read_test('scattered', 1, 333);
    
    -- jumped rowid, scattered read, 'db file scattered read'
    SELECT /*+ scattered_read */ Y FROM KSUN_TAB T WHERE ROWID = :B1 
    
    call     count       cpu    elapsed       disk      query    current        rows
    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
    Parse        1      0.00       0.00          0          0          0           0
    Execute    333      0.00       0.00          0          0          0           0
    Fetch      333      0.02       0.02       2664        333          0         333
    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
    total      667      0.02       0.02       2664        333          0         333
    
    
    Row Source Operation
    ---------------------------------------------------
    TABLE ACCESS BY USER ROWID KSUN_TAB (cr=1 pr=8 pw=0 time=156 us cost=1 size=3513 card=1)
    
    Elapsed times include waiting on following events:
      Event waited on                             Times   Max. Wait  Total Waited
      ----------------------------------------   Waited  ----------  ------------
      db file scattered read                        333        0.00          0.00
      
    ------------------------------ dtrace ------------------------------
      PROBEFUNC      FD  RETURN_SIZE     COUNT
      lseek         260            0        91
      readv         260        65536       333
    
      PROBEFUNC      FD  MAX_READ_Blocks
      readv         260                8
    
    TOTAL_SIZE = 21823488  , TOTAL_READ_Blocks = 2664  , TOTAL_READ_CNT = 333
    
    readv         260
               value  ------------- Distribution ------------- count
                8192 |                                         0
               16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      290
               32768 |@@@@@                                    43
               65536 |                                         0
    


    1.3. parallel Read


    
    SQL > exec db_file_read_test('parallel', 1, 333);
    
    SELECT /*+ index(t ksun_tab#i1) parallel_read */ MAX(Y) FROM KSUN_TAB T WHERE X BETWEEN 1 AND :B1 
    
    call     count       cpu    elapsed       disk      query    current        rows
    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
    Parse        1      0.00       0.00          0          0          0           0
    Execute      1      0.00       0.00          0          0          0           0
    Fetch        1      0.00       0.00        344        335          0           1
    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
    total        3      0.00       0.00        344        335          0           1
    
    Row Source Operation
    ---------------------------------------------------
    SORT AGGREGATE (cr=335 pr=344 pw=0 time=3760 us)
     FILTER  (cr=335 pr=344 pw=0 time=1698 us)
      TABLE ACCESS BY INDEX ROWID BATCHED KSUN_TAB (cr=335 pr=344 pw=0 time=1361 us cost=168 size=1167165 card=333)
       INDEX RANGE SCAN KSUN_TAB#I1 (cr=2 pr=8 pw=0 time=279 us cost=1 size=0 card=333)(object id 2260477)
    
      Event waited on                             Times   Max. Wait  Total Waited
      ----------------------------------------   Waited  ----------  ------------
      db file scattered read                          4        0.00          0.00
      db file parallel read                           2        0.00          0.00
      
     -- Raw Trace File --
     'db file scattered read' ela= 49  file#=917 block#=10368 blocks=8     obj#=2260477 (Index KSUN_TAB#I1)
     'db file scattered read' ela= 27  file#=917 block#=128   blocks=8     obj#=2260476 (Table KSUN_TAB) 
     'db file scattered read' ela= 21  file#=917 block#=136   blocks=8     obj#=2260476
     'db file parallel read'  ela= 422 files=1   blocks=127   requests=127 obj#=2260476
     'db file parallel read'  ela= 334 files=1   blocks=127   requests=127 obj#=2260476
     'db file scattered read' ela= 264 file#=917 block#=409   blocks=66    obj#=2260476
      
    ------------------------------ dtrace ------------------------------
       PROBEFUNC      FD  RETURN_SIZE     COUNT
       pread         260       540672         1
       lseek         260            0         2
       readv         260        65536         3
       pread         260         8192       254
     
       PROBEFUNC      FD  MAX_READ_Blocks
       readv         260                8
       pread         260               66
     
     TOTAL_SIZE = 2818048   , TOTAL_READ_Blocks = 344   , TOTAL_READ_CNT = 258
     
       readv         260
                value  ------------- Distribution ------------- count
                 8192 |                                         0
                16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              2
                32768 |@@@@@@@@@@@@@                            1
                65536 |                                         0
     
       pread         260
                value  ------------- Distribution ------------- count
                 2048 |                                         0
                 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@          196
                 8192 |@@@@@@@@                                 52
                16384 |                                         2
                32768 |                                         0
                65536 |@                                        4
               131072 |                                         1
               262144 |                                         0
    


    1.4. full Read


    
    SQL > exec db_file_read_test('full', 1, 333);
    
    SELECT /*+ full_read */ MAX(Y) FROM KSUN_TAB T WHERE ROWNUM <= :B1 
    
    call     count       cpu    elapsed       disk      query    current        rows
    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
    Parse        1      0.00       0.00          0          0          0           0
    Execute      1      0.00       0.00          0          0          0           0
    Fetch        1      0.00       0.00        342        342          3           1
    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
    total        3      0.00       0.00        342        342          3           1
    
    Row Source Operation
    ---------------------------------------------------
    SORT AGGREGATE (cr=342 pr=342 pw=0 time=3788 us)
     COUNT STOPKEY (cr=342 pr=342 pw=0 time=1371 us)
      TABLE ACCESS FULL KSUN_TAB (cr=342 pr=342 pw=0 time=925 us cost=99 size=1169334 card=334)
    
      Event waited on                             Times   Max. Wait  Total Waited
      ----------------------------------------   Waited  ----------  ------------
      db file sequential read                         2        0.00          0.00
      db file scattered read                         23        0.00          0.00
    
    -- Raw Trace File --
    'db file sequential read' ela= 19 file#=917 block#=130 blocks=1 obj#=2260476 tim=647299975335
    'db file sequential read' ela= 14 file#=3 block#=768 blocks=1 obj#=0 tim=647299975399
         -- UNDO file#=3   /oratestdb/oradata/testdb/undo01.dbf
    'db file scattered read' ela= 22 file#=917 block#=131 blocks=5 obj#=2260476 tim=647299975501
    'db file scattered read' ela= 25 file#=917 block#=136 blocks=8 obj#=2260476 tim=647299975609
    'db file scattered read' ela= 25 file#=917 block#=145 blocks=7 obj#=2260476 tim=647299975713
    'db file scattered read' ela= 23 file#=917 block#=152 blocks=8 obj#=2260476 tim=647299975806
    'db file scattered read' ela= 25 file#=917 block#=161 blocks=7 obj#=2260476 tim=647299975901
    'db file scattered read' ela= 24 file#=917 block#=168 blocks=8 obj#=2260476 tim=647299975994
    'db file scattered read' ela= 23 file#=917 block#=177 blocks=7 obj#=2260476 tim=647299976088
    'db file scattered read' ela= 23 file#=917 block#=184 blocks=8 obj#=2260476 tim=647299976178
    'db file scattered read' ela= 23 file#=917 block#=193 blocks=7 obj#=2260476 tim=647299976270
    'db file scattered read' ela= 22 file#=917 block#=200 blocks=8 obj#=2260476 tim=647299976364
    'db file scattered read' ela= 23 file#=917 block#=209 blocks=7 obj#=2260476 tim=647299976465
    'db file scattered read' ela= 22 file#=917 block#=216 blocks=8 obj#=2260476 tim=647299976554
    'db file scattered read' ela= 22 file#=917 block#=225 blocks=7 obj#=2260476 tim=647299976646
    'db file scattered read' ela= 29 file#=917 block#=232 blocks=8 obj#=2260476 tim=647299976759
    'db file scattered read' ela= 24 file#=917 block#=241 blocks=7 obj#=2260476 tim=647299976866
    'db file scattered read' ela= 23 file#=917 block#=248 blocks=8 obj#=2260476 tim=647299976956
    'db file scattered read' ela= 128 file#=917 block#=258 blocks=32 obj#=2260476 tim=647299977200
    'db file scattered read' ela= 95 file#=917 block#=290 blocks=32 obj#=2260476 tim=647299977511
    'db file scattered read' ela= 97 file#=917 block#=322 blocks=32 obj#=2260476 tim=647299977822
    'db file scattered read' ela= 87 file#=917 block#=354 blocks=30 obj#=2260476 tim=647299978113
    'db file scattered read' ela= 96 file#=917 block#=386 blocks=32 obj#=2260476 tim=647299978407
    'db file scattered read' ela= 108 file#=917 block#=418 blocks=32 obj#=2260476 tim=647299978719
    'db file scattered read' ela= 94 file#=917 block#=450 blocks=32 obj#=2260476 tim=647299979021
    
    ------------------------------ dtrace ------------------------------
       PROBEFUNC      FD  RETURN_SIZE     COUNT
       pread         260         8192         1
       pread         260       245760         1
       readv         260        40960         1
       pread         260       262144         6
       readv         260        57344         7
       lseek         260            0         8
       readv         260        65536         8
     
       PROBEFUNC      FD  MAX_READ_Blocks
       readv         260                8
       pread         260               32
     
     TOTAL_SIZE = 2793472   , TOTAL_READ_Blocks = 341   , TOTAL_READ_CNT = 24
     
       readv         260
                value  ------------- Distribution ------------- count
                 8192 |                                         0
                16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 16
                32768 |                                         0
     
       pread         260
                value  ------------- Distribution ------------- count
                16384 |                                         0
                32768 |@@@@@@@@@@@@@@@@@@@@                     4
                65536 |@@@@@@@@@@@@@@@                          3
               131072 |@@@@@                                    1
               262144 |                                         0
    
    *Note: One extra Read in SQL Trace: 25 (=23+2) Reads with 342 Blocks in SQL Trace vs. 
           dtrace 24 Reads with 341 Blocks is due to one UNDO Read (file#=3).
    


    2. UNIX References


    At first some documentations from UNIX man pages and internet.
      ssize_t pread(int fildes, void * buf, size_t nbyte, off_t offset);
      ssize_t readv(int fildes, struct iovec * iov, int iovcnt);
      off_t lseek(int fildes, off_t offset, int whence)
    
    pread(): performs the same action as read(), except that it reads from a given position in the file 
             without changing the file pointer. The first three arguments to pread() are the same as read() with the addition of 
             a fourth argument offset for the desired position inside the file. pread() will read up to the maximum offset value 
             that can be represented in an off_t for regular files.
    
    readv(): is equivalent to read(), but places the input data into the iovcnt buffers 
             specified by the members of the iov array: iov 0 , iov 1 , ..., iov [ iovcnt -1]. 
             The iovcnt argument is valid if greater than 0 and less than or equal to IOV_MAX .
             
             The iovec structure contains the following members:
                 caddr_t   iov_base;
                 int       iov_len;  
             Each iovec entry specifies the base address and length of an area in memory where data should be placed. 
             The readv() function always fills an area completely before proceeding to the next.
             
             Upon successful completion, readv() marks for update the st_atime field of the file.
             
             IOV_MAX: 1024 On Linux, 16 On Solaris, 16 On AIX and HP-UX
                
    readv (Linux): Read data from file descriptor FD, and put the result in the buffers described by VECTOR, 
             which is a vector of COUNT 'struct iovec's. The buffers are filled in the order specified. 
             Operates just like 'read' (see <unistd.h>) except that data are put in VECTOR instead of a contiguous buffer.    
    
    lseek(): sets the file pointer associated with the open file descriptor specified by fildes.
    


    3. Oracle References


    db file sequential read: P1 = file#, P2 = block#, P3 = blocks
     file#:  This is the file# of the file that Oracle is trying to read from. From Oracle8 onwards it is the ABSOLUTE file number (AFN).
     block#: This is the starting block number in the file from where Oracle starts reading the blocks. Typically only one block is being read.
     blocks: This parameter specifies the number of blocks that Oracle is trying to read from the file# starting at block#. This is
       usually "1" but if P3 > 1 then this is a multiblock read. Multiblock "db file sequential read"s may be seen in 
       earlier Oracle versions when reading from a SORT (TEMPORARY) segments. 
       
    db file scattered read:  P1 = file#, P2 = block#, P3 = blocks
     file#:  This is the file# of the file that Oracle is trying to read from. In Oracle8 onwards it is the ABSOLUTE file number (AFN).
     block#: This is the starting block number in the file from where Oracle starts reading the blocks. 
     blocks: This parameter specifies the number of blocks that Oracle is trying to read from the file# starting at block#. 
       The upper limit is DB_FILE_MULTIBLOCK_READ_COUNT, which is self tuned from Oracle 10.2 onwards.
                    
    db file parallel read:   P1 = files, P2 = blocks, P3 = requests
     files:    This indicates the number of files to which the session is reading
     blocks:   This indicates the total number of blocks to be read
     requests: This indicates the total number of I/O requests, which will be the same as blocks
     
    WAITEVENT: "db file sequential read" Reference Note (Doc ID 34559.1)
     This signifies a wait for an I/O read request to complete. This call differs from "db file scattered read" 
     in that a sequential read reads data into contiguous memory (whilst a scattered read reads multiple blocks 
     and scatters them into different buffers in the SGA). 
    
    WAITEVENT: "db file scattered read" Reference Note (Doc ID 34558.1) 
     This wait happens when a session is waiting for a multiblock IO to complete. This typically occurs during 
     FULL TABLE SCANs or INDEX FAST FULL SCANs. Oracle reads up to DB_FILE_MULTIBLOCK_READ_COUNT consecutive blocks 
     at a time and scatters them into buffers in the buffer cache.
      
    C.3.34 db file parallel read
     This happens during recovery. It can also happen during buffer prefetching, 
     as an optimization (rather than performing multiple single-block reads). 
    


    4. Discussions


    (a). pread and readv: both read contiguous file space. pread places the input data into one single contiguous buffer (memory space);
         whereas readv distributes them into multi buffers.
         
         It looks like that the difference between pread and readv is the difference of memory allocation. Their disk operations are the same.
         
         pread specifies file reading position by a third parameter: offset;
         whereas readv requires a precedent lseak (with parameter: offset) to fix the file reading position.
    
         Contiguous file space means logically contiguous in a file, but does not mean physically contiguous in a disk.
         Oracle 12 Logical Storage Structures Section: Overview of Extents wrote:
              An extent is a unit of database storage made up of logically contiguous data blocks. Data blocks can be physically spread out 
                 on disk because of RAID striping and file system implementations.
    
    (b). pread can fulfill all 3 kinds of db file read:
             db file sequential read (Test 1.1. single Read)
             db file scattered read  (Test 1.3. parallel Read, see next discussion)
             db file parallel read   (Test 1.3. parallel Read, see next discussion)
         whereas readv is for
             db file scattered read  (Test 1.2. scattered Read)  
    
    (c). 'db file parallel read' specifies the number of files (first parameter), and number of blocks to read (second and third parameters are equal).
         
         We only observe 'db file parallel read' in Test:
            exec db_file_read_test('parallel', 1, 333);   
    
    
    Event waited on                             Times   Max. Wait  Total Waited
    ----------------------------------------   Waited  ----------  ------------
    db file scattered read                          4        0.00          0.00
    db file parallel read                           2        0.00          0.00
    
    -- Syscalls and Raw Trace File --
    readv: 'db file scattered read' ela= 49  file#=917 block#=10368 blocks=8     obj#=2260477 (Index KSUN_TAB#I1)
    readv: 'db file scattered read' ela= 27  file#=917 block#=128   blocks=8     obj#=2260476 (Table KSUN_TAB) 
    readv: 'db file scattered read' ela= 21  file#=917 block#=136   blocks=8     obj#=2260476
    pread: 'db file parallel read'  ela= 422 files=1   blocks=127   requests=127 obj#=2260476
    pread: 'db file parallel read'  ela= 334 files=1   blocks=127   requests=127 obj#=2260476
    pread: 'db file scattered read' ela= 264 file#=917 block#=409   blocks=66    obj#=2260476
    
    ------------------------------ dtrace ------------------------------
      PROBEFUNC      FD  RETURN_SIZE     COUNT
      pread         260       540672         1
      lseek         260            0         2
      readv         260        65536         3
      pread         260         8192       254
      
      PROBEFUNC      FD  MAX_READ_Blocks
      readv         260                8
      pread         260               66
    
    TOTAL_SIZE = 2818048   , TOTAL_READ_Blocks = 344   , TOTAL_READ_CNT = 258
    
    From above SQL Raw Trace File and Dtrace, we can see: first 3 'db file scattered read' with "blocks=8", corresponding to 3 readv, each reads 65536 Bytes (8 Blocks). next 2 'db file parallel read' with "blocks=127", corresponding to 254 pread, each reads 8192 Bytes (1 Block). last 'db file scattered read' with "blocks=66", corresponding to 1 pread, which reads 540672 Bytes (66 Blocks). In other words, there exist only 3 readv, which read 3*65536 Bytes=24 Blocks. These 3 readv can neither match to 254 (=127 + 127) Blocks 'db file parallel read', nor 66 Blocks 'db file scattered read'. Therefore both 'db file parallel read' and 'db file scattered read' in this test are accomplished by pread. 'db file parallel read' is similar to 'db file sequential read', the difference is that former reads multi Blocks (probably asynchronously), but later reads one single Block. In fact, we can observe the 254 aio requests by:
    
    SQL > exec db_file_read_test('parallel', 1, 333);
    
    sudo dtrace -n '
    syscall::pread:entry / pid == $1 && arg0 == $2 / {self->pread_fd = arg0;} 
    syscall::pread:return/ pid == $1 && self->pread_fd == $2 / 
     {@STACK_CNT[probefunc, self->pread_fd, arg1, ustack(5, 0)]=count(); self->pread_fd = 0;}
    ' 11352 260 
    
      pread   260   540672
        libc.so.1`_pread+0xa
        oracle`skgfqio+0x284
        oracle`ksfd_skgfqio+0x195
        oracle`ksfd_skgfrvio+0xcb4
        oracle`ksfd_vio+0x9a3
            1
            
      pread   260   8192
        libc.so.1`_pread+0xa
        libaio.so.1`_aio_do_request+0x18e
        libc.so.1`_thr_setup+0x5b
        libc.so.1`_lwp_start
          254
    
    'db file parallel read' has 2 "plural" parameters, P1 (files), P3 (requests) (P2=P3). Both can denote one dimension of parallel operations. P1 (files) signifies multi files reading in parallel. P3 (requests) stands for multi parallel Disk Reading requests delegating to multi-threading LWP processes. Similar to 'log file parallel write' and 'control file parallel write', in which first parameter "files" represents number of log files (in one redo group) and number of control files. In the above example, since files=1, therefore 'db file parallel read' implies multi requests. Because multi requests are performed asynchronously, probably better named as 'db file async read'. If we set DISK_ASYNCH_IO=false (and restart DB), there is no more aio observed with dtrace as shown in following output, but SQL Trace is not able to reveal this setting change, and it still shows the same output as above.
    
      pread   268   540672
        libc.so.1`_pread+0xa
        oracle`skgfqio+0x284
        oracle`ksfd_skgfqio+0x195
        oracle`ksfd_skgfrvio+0xcb4
        oracle`ksfd_vio+0x9a3
            1
      pread    268   8192
        libc.so.1`_pread+0xa
        oracle`skgfqio+0x284
        oracle`ksfd_skgfqio+0x203
        oracle`ksfdgo+0x188
        oracle`ksfd_sbio+0xdd1
          254   
    
    The last 'db file scattered read' with "blocks=66" also shows one pread can read 66 Blocks, much higher than db_file_multiblock_read_count=32 in this database. By the way, we have 3 readv, but only 2 lseek, so there are probably 2 readv share one lseek.


    5. V$FILESTAT vs. V$IOSTAT_FILE Views


    There exist two Oracle views recording file disk I/O statistics:
      1. V$FILESTAT and its Cumulative DBA_HIST_FILESTATXS in centisecond since Oracle 8 (or 7).
      2. V$IOSTAT_FILE and DBA_HIST_IOSTAT_* in milliseconds since Oracle 11.
    
    For temp files, the counterpart of V$FILESTAT is V$TEMPSTAT; and in V$IOSTAT_FILE, filetype_name is marked as 'Temp File'.

    According to MOS (Doc ID 2222370.1): What Does "1-bk Rds/s" and "Av 1-bk Rd(ms)" in "Tablespace IO Stats" and "File IO Stats" in AWR Report Mean? (Doc ID 2222370.1)
    "Tablespace IO Stats" and "File IO Stats" in AWR report come from the columns of DBA_HIST_FILESTATXS.

    But AWR basis table SYS.WRH$_IOSTAT_FILETYPE is filled by V$IOSTAT_FILE.
    
    INSERT INTO WRH$_IOSTAT_FILETYPE (dbid,
                                      con_dbid,
                                      snap_id,
                                      instance_number,
                                      filetype_id,
                                      small_read_megabytes,
                                      small_write_megabytes,
                                      large_read_megabytes,
                                      large_write_megabytes,
                                      small_read_reqs,
                                      small_write_reqs,
                                      small_sync_read_reqs,
                                      large_read_reqs,
                                      large_write_reqs,
                                      small_read_servicetime,
                                      small_write_servicetime,
                                      small_sync_read_latency,
                                      large_read_servicetime,
                                      large_write_servicetime,
                                      retries_on_error)
        SELECT :dbid,
               :srcdbid                    con_dbid,
               :snap_id,
               :instance_number,
               filetype_id,
               SUM (small_read_megabytes)  small_read_megabytes,
               SUM (small_write_megabytes) small_write_megabytes,
               SUM (large_read_megabytes)  large_read_megabytes,
               SUM (large_write_megabytes) large_write_megabytes,
               SUM (small_read_reqs)       small_read_reqs,
               SUM (small_write_reqs)      small_write_reqs,
               SUM (small_sync_read_reqs)  small_sync_read_reqs,
               SUM (large_read_reqs)       large_read_reqs,
               SUM (large_write_reqs)      large_write_reqs,
               SUM (small_read_servicetime) small_read_servicetime,
               SUM (small_write_servicetime) small_write_servicetime,
               SUM (small_sync_read_latency) small_sync_read_latency,
               SUM (large_read_servicetime) large_read_servicetime,
               SUM (large_write_servicetime) large_write_servicetime,
               SUM (retries_on_error)      retries_on_error
          FROM v$iostat_file
      GROUP BY filetype_id;  
    
    In AWR Report, both V$IOSTAT_FILE and V$FILESTAT data are visible in different places. For example, in Section: "IOStat by Filetype summary", the last 2 Columns are named as "Small Read" and "Large Read", which are probably from V$IOSTAT_FILE. But in Section "Tablespace IO Stats" and "File IO Stats", the Column names are prefixed by "1-bk Rd", which are from from V$FILESTAT. Therefore, potential stats inconsistence can appear even in the same AWR report.


    6. DB File Read Stats Tests


    Run test (see appended Test Code):
    
    alter session set timed_statistics = true; 
    alter session set statistics_level=all; 
    
    truncate table read_stats;
    exec db_file_read_test('single',    1, 333);
    exec db_file_read_test('scattered', 1, 333);
    exec db_file_read_test('parallel',  1, 333);
    exec db_file_read_test('full',      1, 333);
    
    and than watch statistics in both views (AIX and Linux are added for comparison. Copy HTML Table to Excel for easy reading):
    
    select test_name,  
           phyrds, phyblkrd, singleblkrds, singleblkrdtim, 
           ceil(small_read_megabytes*1024*1024/8192) small_read_blks,       -- converted to Block for comparison 
           small_read_reqs, small_read_servicetime, small_sync_read_reqs, small_sync_read_latency,  
           ceil(large_read_megabytes*1024*1024/8192) large_read_blks,       -- converted to Block for comparison 
           large_read_reqs, large_read_servicetime 
    from read_stats_delta_v where phyrds > 0 order by test_name desc, test_timestamp;
    
    
    
    OS test_name phyrds phyblkrd singleblkrds singleblkrdtim small_read_blks small_read_reqs small_read_servicetime small_sync_read_reqs small_sync_read_latency large_read_blks large_read_reqs large_read_servicetime
    Solaris single
    333
    641
    289
    10
    640
    333
    0
    289
    0
    0
    0
    0
    Solaris scattered
    333
    2664
    0
    0
    2560
    333
    0
    0
    0
    0
    0
    0
    Solaris parallel
    258
    344
    254
    0
    384
    257
    0
    508
    0
    0
    1
    0
    Solaris full
    24
    341
    1
    0
    0
    17
    0
    1
    0
    256
    7
    0
    AIX single
    333
    333
    333
    150
    384
    333
    37
    333
    37
    0
    0
    0
    AIX scattered
    333
    333
    333
    1460
    384
    333
    1310
    333
    1310
    0
    0
    0
    AIX parallel
    260
    335
    259
    110
    256
    259
    17
    259
    17
    0
    1
    1
    AIX full
    24
    341
    1
    20
    128
    17
    41
    1
    12
    256
    7
    26
    Linux single
    333
    641
    289
    0
    640
    333
    0
    289
    0
    0
    0
    0
    Linux scattered
    333
    2664
    0
    0
    2688
    333
    0
    0
    0
    0
    0
    0
    Linux parallel
    258
    344
    254
    0
    256
    257
    0
    508
    0
    0
    1
    0
    Linux full
    24
    341
    1
    0
    128
    17
    0
    1
    0
    256
    7
    0


    7. DB File Read Stats Test Discussion


    The first 4 stats Columns (green) are from V$FILESTAT, the rest from V$IOSTAT_FILE. Here some observations:
    (a). V$FILESTAT.phyrds matches TOTAL_READ_CNT, equal to V$IOSTAT_FILE.(small_read_reqs + large_read_reqs).
          V$FILESTAT.phyblkrd matches TOTAL_READ_Blocks, approximate to V$IOSTAT_FILE.(small_read_blks + large_read_blks).
          V$FILESTAT.singleblkrds matches V$IOSTAT_FILE.small_sync_read_reqs except "parallel".
             V$FILESTAT.singleblkrds are 1 DB Block(8192 Bytes) Read (' db file sequential read' or 'db file parallel read').
             It is not clear why small_sync_read_reqs (508) in "parallel" is much higher than dtrace TOTAL_READ_Blocks (344).
         V$IOSTAT_FILE.small_read_reqs includes small_sync_read_reqs execept "parallel".
         V$IOSTAT_FILE.small_sync_read_reqs matches 'db file sequential read' except "parallel".
         
         All time metrics are (almost) zero: singleblkrdtim, small_read_servicetime, small_sync_read_latency, large_read_servicetime.
         Time in V$FILESTAT is in centiseconds, V$IOSTAT_FILE in milliseconds. Both could have rounding errors.
         
         AIX seems able to record the time metrics, probably because they are bigger than the minimum time granule.
    
    (b). Only 'scattered' Test:
              exec db_file_read_test('scattered', 1, 333);
         shows much higher small_read_blks: 2560 Blocks by small_read_reqs: 333 Requests (2560/333 approximately 8 Blocks per Request), 
         marked as "db file scattered read" and accomplished by "readv".
    
         However, in V$IOSTAT_FILE, they are entirely classified as SMALL_READ_MEGABYTES (small_read_blks) and SMALL_READ_REQS.
         But Oracle 12c V$IOSTAT_FILE Docu said that SMALL_READ_* are all about "single block read" even though dtrace evidenced 
         them as multi block read (8 Blocks per readv). 
         
         Logically it is controversial if small_read_blks is bigger than small_read_reqs because each single block read request
         can read only one single block.
    
    (c). 'db file parallel read' in "parallel" test seems running asynchronously, and hard to collect precise statistics.
    
    (d). V$IOSTAT_FILE has one pair of stats about single block read
            -. small_read_reqs / small_read_servicetime
            -. small_sync_read_reqs / small_sync_read_latency
    
    (e). Oracle Time Accounting also evolves with time. MOS:
           I/O Wait Event Accounting Change in the RDBMS 12.1.0.2.0 Patch Set Release (Doc ID 1967410.1)
         said:
           Starting with the RDBMS 12.1.0.2.0 Patch Set Release, the accounting for certain I/O-related wait events has been updated 
           to be more accurate than it was before. As a result, these wait events may show an increase of wait times of around 10%.
             
           The wait events that are affected include, but are not limited to, the following:
             db file sequential read
             db file scattered read
             db file parallel read
             log file parallel write
    


    8. Test Code


    
    drop tablespace ksun_ts including contents and datafiles;
    create tablespace ksun_ts datafile '/oratestdb/oradata/testdb/ksun_ts.dbf' size 200m online;
    
    drop table ksun_tab;
    
    -- DB_BLOCK_SIZE = 8192, each row occupies one BLOCK
    create table ksun_tab tablespace ksun_ts as 
    select level x, rpad('ABC', 3500, 'X') y, rpad('ABC', 3500, 'X') z from dual connect by level <= 1e4; 
    
    select round(bytes/1024/1024) mb, blocks from dba_segments where segment_name = 'KSUN_TAB';       
    --80 10240
    
    create index ksun_tab#i1 on ksun_tab(x) tablespace ksun_ts;
    
    exec dbms_stats.gather_table_stats(null, 'KSUN_TAB', cascade=>true);
    
    drop tablespace ksun_ts_aux including contents and datafiles;
    create tablespace ksun_ts_aux datafile '/oratestdb/oradata/testdb/ksun_ts_aux.dbf' size 200m online;
    
    drop table ksun_rid;
    create table ksun_rid tablespace ksun_ts_aux as select x, rowid rid from ksun_tab;
    create index ksun_rid#i1 on ksun_rid(x) tablespace ksun_ts_aux;
    
    exec dbms_stats.gather_table_stats(null, 'KSUN_RID', cascade=>true);
    
    create or replace view read_stats_v as
    select to_char(localtimestamp, 'yyyy-mm-dd hh24:mi:ss') test_timestamp
          ,phyrds, phyblkrd, singleblkrds, 10*singleblkrdtim singleblkrdtim
          ,small_read_megabytes, small_read_reqs, small_read_servicetime, small_sync_read_reqs, small_sync_read_latency
          ,large_read_megabytes, large_read_reqs, large_read_servicetime
          ,f.name
    from   v$filestat v8,  v$iostat_file v11, v$datafile f
    where  v8.file#    = f.file# 
      and  v11.file_no = f.file#
      and  f.name  like '%ksun_ts.dbf';
      
    drop table read_stats;
    
    create table read_stats as select 'setall_seq_readxx' test_name, v.* from read_stats_v v where 1=2;
    
    create or replace view read_stats_delta_v as
    select test_name, test_timestamp
          ,phyrds         - lag(phyrds)         over(partition by test_name order by test_timestamp)                   phyrds                 
          ,phyblkrd       - lag(phyblkrd)       over(partition by test_name order by test_timestamp)                   phyblkrd              
          ,singleblkrds   - lag(singleblkrds)   over(partition by test_name order by test_timestamp)                   singleblkrds 
          ,singleblkrdtim - lag(singleblkrdtim) over(partition by test_name order by test_timestamp)                   singleblkrdtim         
          ,small_read_megabytes    - lag(small_read_megabytes)  over(partition by test_name order by test_timestamp)   small_read_megabytes  
          ,small_read_reqs         - lag(small_read_reqs)         over(partition by test_name order by test_timestamp) small_read_reqs 
          ,small_read_servicetime  - lag(small_read_servicetime)  over(partition by test_name order by test_timestamp) small_read_servicetime 
          ,small_sync_read_reqs    - lag(small_sync_read_reqs)    over(partition by test_name order by test_timestamp) small_sync_read_reqs 
          ,small_sync_read_latency - lag(small_sync_read_latency) over(partition by test_name order by test_timestamp) small_sync_read_latency
          ,large_read_megabytes    - lag(large_read_megabytes)    over(partition by test_name order by test_timestamp) large_read_megabytes
          ,large_read_reqs         - lag(large_read_reqs)         over(partition by test_name order by test_timestamp) large_read_reqs  
          ,large_read_servicetime  - lag(large_read_servicetime)  over(partition by test_name order by test_timestamp) large_read_servicetime 
    from read_stats s;
    
    create or replace procedure db_file_read_test (p_test_name varchar2, p_loops number, p_rows number) as
     l_max_y           varchar2(3500);
     type tab_rowid is table of rowid index by pls_integer;
     l_rowid_cache     tab_rowid;
    begin
      case 
       when p_test_name = 'single' then
         select rid bulk collect into l_rowid_cache from ksun_rid where x between 1 and p_rows;                
       when p_test_name = 'scattered' then
         select rid bulk collect into l_rowid_cache from ksun_rid where mod(x, 10) = 0 and rownum <= p_rows;   
       else null;
     end case;
     dbms_output.put_line('Number of Rows to read = '|| l_rowid_cache.count);
     
     insert into read_stats select p_test_name test_name, v.* from read_stats_v v;
     for i in 1..p_loops loop
      execute immediate 'alter system flush buffer_cache';
      dbms_lock.sleep(1);
      case 
        when p_test_name = 'single' then
         for r in 1..l_rowid_cache.count loop
           -- adjacent rowid, single block read, 'db file sequential read'
           select /*+ single_read */ y into l_max_y from ksun_tab t where rowid = l_rowid_cache(r);
         end loop;
        when p_test_name = 'scattered' then
         for r in 1..l_rowid_cache.count loop
           -- jumped rowid, scattered read, 'db file scattered read'
           select /*+ scattered_read */ y into l_max_y from ksun_tab t where rowid = l_rowid_cache(r);
         end loop;   
       when p_test_name = 'parallel' then
         -- table access by index rowid batched   'db file parallel read'
         select /*+ index(t ksun_tab#i1) parallel_read */ max(y) into l_max_y from ksun_tab t where x between 1 and p_rows;
       when p_test_name = 'full' then
         -- table access by FULL
         select /*+ full_read */ max(y) into l_max_y from ksun_tab t where rownum <= p_rows;
       end case;
      dbms_lock.sleep(1);
      insert into read_stats select p_test_name test_name, v.* from read_stats_v v;
      end loop;
      commit;
    end;
    /
    


    9. Dtrace Script


    
    pfiles 11352
      260: /oratestdb/oradata/testdb/ksun_ts.dbf
      
    --Dtrace Script
    
    #!/usr/sbin/dtrace -Zs
    
    /* 
     * read_dtrace.d   pid   fd
     * chmod u+x read_dtrace.d
     * sudo ./read_dtrace.d 11352 260 
    */
    
    BEGIN / $1 > 0 && $2 > 0 / 
     {TOTAL_SIZE = 0; TOTAL_READ_CNT = 0; } 
    syscall::pread:entry / pid == $1 && arg0 == $2 / 
      {self->pread_fd = arg0; self->pread_t = timestamp;} 
    syscall::pread:return/ pid == $1 && self->pread_fd == $2 / 
     {@CNT[probefunc, self->pread_fd, arg1] = count(); 
      @MAXB[probefunc, self->pread_fd] = max(arg1/8192);
      @ETIME[probefunc, self->pread_fd] = quantize(timestamp- self->pread_t); 
       TOTAL_SIZE = TOTAL_SIZE + arg1; TOTAL_READ_CNT = TOTAL_READ_CNT + 1;
       self->pread_fd = 0;} 
    syscall::readv:entry / pid == $1 && arg0 == $2 / 
     {self->readv_fd = arg0; self->readv_t = timestamp; } 
    syscall::readv:return/ pid == $1 && self->readv_fd == $2 / 
     {@CNT[probefunc, self->readv_fd, arg1] = count(); 
      @MAXB[probefunc, self->readv_fd] = max(arg1/8192);
      @ETIME[probefunc, self->readv_fd] = quantize(timestamp- self->readv_t); 
       TOTAL_SIZE = TOTAL_SIZE + arg1; TOTAL_READ_CNT = TOTAL_READ_CNT + 1;
       self->readv_fd = 0;} 
    syscall::kaio:entry  / pid == $1 && arg1 == $2 / 
     {self->kaio = arg1;} 
    syscall::kaio:return / pid == $1 && self->kaio == $2 /  
     {@CNT[probefunc,  self->kaio, arg1] = count(); self->kaio  = 0;} 
    syscall::lseek:entry / pid == $1 && arg0 == $2/ 
     {@CNT[probefunc, arg0, 0] = count(); } 
    END / $1 > 0 && $2 > 0 / 
        {printf("\n%11s  %6s %12s %9s \n", "PROBEFUNC", "FD", "RETURN_SIZE", "COUNT");
       printa("  %-10s %6d %12d %9@d\n", @CNT);
       printf("\n%11s  %6s %16s \n", "PROBEFUNC", "FD", "MAX_READ_Blocks");
       printa("  %-10s %6d %16@d\n", @MAXB);
       printf("\nTOTAL_SIZE = %-10d, TOTAL_READ_Blocks = %-6d, TOTAL_READ_CNT = %-6d\n", 
                 TOTAL_SIZE, TOTAL_SIZE/8192, TOTAL_READ_CNT);
       printa("\n%-10s %6d %16@d\n", @ETIME);}
    

    Oracle Buffer Cache Read Access Path and Cost

    This Blog will first test 4 different Buffer Cache Read (aka, Logical Read) Access Path and measure the Cost in terms of Event 10200 ("consistent read buffer status"), then demonstrate 'latch: cache buffers chains' in row-by-row slow processing.

    Note: All tests are done in Oracle 12.1.0.2 on AIX, Solaris, Linux with 6 physical processors.


    1. Test Setup


    Pick the same Test Code from Blog: Hot Block Identification and latch: cache buffers chains.
    
    drop table test_tab;
                  
    create table test_tab 
      INITRANS   26           -- prevent Segments ITL Waits and Deadlocks
    as select level id, rpad('ABC', 10, 'X') val from dual connect by level <= 5;
    
    alter table test_tab minimize records_per_block;
    
    truncate table test_tab;
    
    insert into test_tab select level id, rpad('ABC', 10, 'X') val from dual connect by level <= 100;
    
    commit;
    
    drop index test_tab#u1;
    
    create unique index test_tab#u1 on k.test_tab (id) 
      pctfree    90
      initrans   26        -- prevent Segments ITL Waits and Deadlocks
    ;
    
    exec dbms_stats.gather_table_stats(null, 'TEST_TAB', cascade => TRUE);
    
    select block, to_char(block, 'xxxxxxxx') block_hex, count (*) cnt 
      from (select rowid rid, 
                   dbms_rowid.rowid_object (rowid) object, 
                   dbms_rowid.rowid_relative_fno (rowid) fil, 
                   dbms_rowid.rowid_block_number (rowid) block, 
                   dbms_rowid.rowid_row_number (rowid) ro, 
                   t.* 
              from test_tab t 
             order by fil, block, ro) 
     group by block 
     order by block, cnt desc;
    
         BLOCK BLOCK_HEX        CNT
      -------- --------- ----------
       3070331    2ed97b          5
       3070332    2ed97c          5
       3070333    2ed97d          5
       3070334    2ed97e          5
       3070335    2ed97f          5
       3072640    2ee280          5
       3072641    2ee281          5
       3072642    2ee282          5
       3072643    2ee283          5
       3072644    2ee284          5
       3072645    2ee285          5
       3072646    2ee286          5
       3072647    2ee287          5
       3072649    2ee289          5
       3072650    2ee28a          5
       3072651    2ee28b          5
       3072652    2ee28c          5
       3072653    2ee28d          5
       3072654    2ee28e          5
       3072655    2ee28f          5
     20 rows selected. 
      
    select object_name, object_id from dba_objects where object_name = 'TEST_TAB#U1';  
    
      OBJECT_NAME  OBJECT_ID
      ----------- ----------
      TEST_TAB#U1    2260907
    
    select block#, to_char(block#, 'xxxxxxxx') block#_hex
          ,count(*) rows_per_block
    from (
      select dbms_rowid.rowid_block_number(sys_op_lbid (2260907, 'L', t.rowid))  block#
            ,t.*
        from test_tab t
       where id is not null
    )
    group by block#
    order by block#; 
    
        BLOCK# BLOCK#_HE ROWS_PER_BLOCK
      -------- --------- --------------
       3072660    2ee294              5
       3072661    2ee295              5
       3072662    2ee296              5
       3072663    2ee297              5
       3072664    2ee298              5
       3072665    2ee299              5
       3072666    2ee29a              5
       3072667    2ee29b              5
       3072668    2ee29c              5
       3072669    2ee29d              5
       3072670    2ee29e              5
       3072671    2ee29f              5
       3072673    2ee2a1              5
       3072674    2ee2a2              5
       3072675    2ee2a3              5
       3072676    2ee2a4              5
       3072677    2ee2a5              5
       3072678    2ee2a6              5
       3072679    2ee2a7              5
       3072680    2ee2a8              5
     20 rows selected 
    


    2. Buffer Read Access Path Tests


    Trace Event 10200 shows all consistent read Blocks and their sequence. We will demonstrate the number of consistent read Blocks with different access path.

    At first, display Table and Index information:
    
    column object_name format a12
    column object_id_hex format a20
    column data_object_id_hex format a20
    
    select object_name, object_id, data_object_id, 
           to_char(object_id, 'xxxxxxxx') object_id_hex, to_char(data_object_id, 'xxxxxxxx') data_object_id_hex
     from dba_objects where object_name in ('TEST_TAB', 'TEST_TAB#U1');
    
    
     OBJECT_NAME  OBJECT_ID  DATA_OBJECT_ID  OBJECT_ID_HEX  DATA_OBJECT_ID_HEX
     ----------- ---------- --------------- -------------- -------------------
     TEST_TAB       2260905         2260906         227fa9              227faa
     TEST_TAB#U1    2260907         2260907         227fab              227fab
    


    2.1 USER ROWID without using Index


    First collect ROWID for test, then make the test:
    
    select /*+ user_rowid_no_index */ t.*, rowid rid from test_tab t 
    where id in (1, 2, 3, 4, 5, 6, 7) order by t.id;
    
      ID VAL        RID
     --- ---------- ------------------
       1 ABCXXXXXXX AAIn+qAAAAALtl7AAA
       2 ABCXXXXXXX AAIn+qAAAAALtl7AAB
       3 ABCXXXXXXX AAIn+qAAAAALtl7AAC
       4 ABCXXXXXXX AAIn+qAAAAALtl7AAD
       5 ABCXXXXXXX AAIn+qAAAAALtl7AAE
       6 ABCXXXXXXX AAIn+qAAAAALtl8AAA
       7 ABCXXXXXXX AAIn+qAAAAALtl8AAB
    
    alter session set events '10200 trace name context forever,level 10';
    
    select /*+ user_rowid_no_index */ * from test_tab t 
    where id in (1, 2, 3, 4, 5, 6, 7)
      and rowid in 
          ('AAIn+qAAAAALtl7AAA'
          ,'AAIn+qAAAAALtl7AAB'
          ,'AAIn+qAAAAALtl7AAC'
          ,'AAIn+qAAAAALtl7AAD'
          ,'AAIn+qAAAAALtl7AAE'
          ,'AAIn+qAAAAALtl8AAA'
          ,'AAIn+qAAAAALtl8AAB');
    
    alter session set events '10200 trace name context off';
    
    Rows (1st) Rows (avg) Rows (max)  Row Source Operation (0 Branch Block, 0 Leaf Blocks, 3 Table Blocks)
    ---------- ---------- ----------  ---------------------------------------------------
             7          7          7  INLIST ITERATOR  (cr=3 pr=0 pw=0 time=130 us)
             7          7          7   TABLE ACCESS BY USER ROWID TEST_TAB (cr=3 pr=0 pw=0 time=261 us cost=1 size=14 card=1)
    
    ktrget2(): started for block  
     <0x07cf : 0x002ed97b> objd: 0x00227faa       <-- Table  Block 1
     <0x07cf : 0x002ed97b> objd: 0x00227faa       <-- Table  Block 1
     <0x07cf : 0x002ed97c> objd: 0x00227faa       <-- Table  Block 2
    


    2.2 INDEX RANGE SCAN with INDEX ROWID BATCHED


    
    alter session set events '10200 trace name context forever,level 10';
    
    select /*+ index_range_scan */ * from test_tab t where id between 1 and 7;
    
    alter session set events '10200 trace name context off';
    
    Row Source Operation (1 Branch Block, 3 Leaf Blocks, 3 Table Blocks)
    ---------------------------------------------------
    TABLE ACCESS BY INDEX ROWID BATCHED TEST_TAB (cr=7 pr=0 pw=0 time=210 us cost=3 size=98 card=7)
     INDEX RANGE SCAN TEST_TAB#U1 (cr=4 pr=0 pw=0 time=552 us cost=2 size=0 card=7)(object id 2260907)
    
    ktrgtc2(): started for block
      <0x07cf : 0x002ee293> objd: 0x00227fab   <-- Branch Block 
      <0x07cf : 0x002ee294> objd: 0x00227fab     <-- Leaf  Block 1
      <0x07cf : 0x002ed97b> objd: 0x00227faa       <-- Table  Block 1
      <0x07cf : 0x002ee294> objd: 0x00227fab     <-- Leaf  Block 1
      <0x07cf : 0x002ed97b> objd: 0x00227faa       <-- Table  Block 1
      <0x07cf : 0x002ee295> objd: 0x00227fab     <-- Leaf  Block 2
      <0x07cf : 0x002ed97c> objd: 0x00227faa       <-- Table  Block 2
    


    2.3 INDEX UNIQUE SCAN


    
    select /*+ index_uniqe_scan */ * from test_tab t where id in (1, 2, 3, 4, 5, 6, 7);
    
    Row Source Operation (2 Branch Block, 8 Leaf Blocks, 3 Table Blocks)
    ---------------------------------------------------
    INLIST ITERATOR  (cr=13 pr=0 pw=0 time=218 us)
     TABLE ACCESS BY INDEX ROWID TEST_TAB (cr=13 pr=0 pw=0 time=812 us cost=3 size=98 card=7)
      INDEX UNIQUE SCAN TEST_TAB#U1 (cr=10 pr=0 pw=0 time=585 us cost=2 size=0 card=7)(object id 2260907)
      
    ktrgtc2(): started for block  
      <0x07cf : 0x002ee293> objd: 0x00227fab   <-- Branch Block
      <0x07cf : 0x002ee294> objd: 0x00227fab     <-- Leaf  Block 1
      <0x07cf : 0x002ed97b> objd: 0x00227faa       <-- Table  Block 1
      <0x07cf : 0x002ee293> objd: 0x00227fab   <-- Branch Block
      <0x07cf : 0x002ee294> objd: 0x00227fab     <-- Leaf  Block 1
      <0x07cf : 0x002ed97b> objd: 0x00227faa       <-- Table  Block 1
      <0x07cf : 0x002ee294> objd: 0x00227fab     <-- Leaf  Block 1
      <0x07cf : 0x002ee294> objd: 0x00227fab     <-- Leaf  Block 1
      <0x07cf : 0x002ee294> objd: 0x00227fab     <-- Leaf  Block 1
      <0x07cf : 0x002ee294> objd: 0x00227fab     <-- Leaf  Block 1
      <0x07cf : 0x002ee295> objd: 0x00227fab     <-- Leaf  Block 2
      <0x07cf : 0x002ed97c> objd: 0x00227faa       <-- Table  Block 2
      <0x07cf : 0x002ee295> objd: 0x00227fab     <-- Leaf  Block 2
    


    2.4 INDEX ROWID GET


    We can construct a pipelined functions to force INDEX ROWID GETs in a given frequency.
    
    create type number_tab as table of number;
    /
    
    create or replace function id_pipelined(p_cnt number, p_sleep_seconds number := 0) 
    return number_tab pipelined as
    begin
      for i in 1..p_cnt loop
        if p_sleep_seconds > 0 then
          dbms_lock.sleep(p_sleep_seconds);  -- purposed delay, specially before producing first row
        end if;
        pipe row(i);
      end loop;
      return;
      exception
        when no_data_needed then
          raise;
        when others then
          dbms_output.put_line('others Handler');
          raise;
    end;
    /
    
    select /*+ leading(c) cardinality(c 10) indx(t test_tab#u1) index_rowid_pipelined */ * 
      from table(cast (id_pipelined(7, 1.3) as number_tab)) c, test_tab t
     where c.column_value = t.id;
      
    Row Source Operation (2 Branch Block, 8 Leaf Blocks, 7 Table Blocks)
    ---------------------------------------------------
    NESTED LOOPS  (cr=17 pr=0 pw=0 time=253 us cost=35 size=160 card=10)
     NESTED LOOPS  (cr=10 pr=0 pw=0 time=7802413 us cost=35 size=160 card=10)
      COLLECTION ITERATOR PICKLER FETCH ID_PIPELINED (cr=0 pr=0 pw=0 time=7801325 us cost=30 size=20 card=10)
      INDEX UNIQUE SCAN TEST_TAB#U1 (cr=10 pr=0 pw=0 time=823 us cost=1 size=0 card=1)(object id 2260907)
     TABLE ACCESS BY INDEX ROWID TEST_TAB (cr=7 pr=0 pw=0 time=395 us cost=1 size=14 card=1)
      
    ktrgtc2(): started for block 
     <0x07cf : 0x002ee293> objd: 0x00227fab  <-- Branch Block 
     <0x07cf : 0x002ee294> objd: 0x00227fab    <-- Leaf  Block 1
     <0x07cf : 0x002ed97b> objd: 0x00227faa      <-- Table  Block 1
     <0x07cf : 0x002ee293> objd: 0x00227fab  <-- Branch Block 
     <0x07cf : 0x002ee294> objd: 0x00227fab    <-- Leaf  Block 1
     <0x07cf : 0x002ed97b> objd: 0x00227faa      <-- Table  Block 1
     <0x07cf : 0x002ee294> objd: 0x00227fab    <-- Leaf  Block 1
     <0x07cf : 0x002ed97b> objd: 0x00227faa      <-- Table  Block 1
     <0x07cf : 0x002ee294> objd: 0x00227fab    <-- Leaf  Block 1
     <0x07cf : 0x002ed97b> objd: 0x00227faa      <-- Table  Block 1
     <0x07cf : 0x002ee294> objd: 0x00227fab    <-- Leaf  Block 1
     <0x07cf : 0x002ed97b> objd: 0x00227faa      <-- Table  Block 1
     <0x07cf : 0x002ee294> objd: 0x00227fab    <-- Leaf  Block 1
     <0x07cf : 0x002ee295> objd: 0x00227fab    <-- Leaf  Block 2
     <0x07cf : 0x002ed97c> objd: 0x00227faa      <-- Table  Block 2
     <0x07cf : 0x002ee295> objd: 0x00227fab    <-- Leaf  Block 2
     <0x07cf : 0x002ed97c> objd: 0x00227faa      <-- Table  Block 2
    


    3. Discussions


    3.1 Access Path


    Above 4 tests showed:
      USER ROWID:         3 Blocks (0 Branch Block, 0 Leaf Blocks, 3 Table Blocks)
      INDEX RANGE SCAN:   7 Blocks (1 Branch Block, 3 Leaf Blocks, 3 Table Blocks)
      INDEX UNIQUE SCAN: 13 Blocks (2 Branch Block, 8 Leaf Blocks, 3 Table Blocks)
      INDEX ROWID GET:   17 Blocks (2 Branch Block, 8 Leaf Blocks, 7 Table Blocks)
    
    The fourth test is the most expensive one because pipelined function is a row-by-row processing. The consumer starts processing only after producer produces it (the principle of "pipelined"). In the above example, consumer waits for "p_sleep_seconds" when receiving first row from producer. In the real applications, during "p_sleep_seconds", any modifications can occur, and in order to maintain Read Consistency from query start SCN, Oracle should use UNDO to clone Buffer, hence, 'latch: cache buffers chains' (see Demo in next Section).

    In the aspect of Disk Read, row-by-row access is fulfilled by "db file sequential read" instead of more efficient multiblock (batch) "db file scattered read" or "db file parallel read" (to be discussed in next Blog: Oracle DB File Read Access Path and Cost).

    Oracle 12c Database Data Cartridge Developer's Guide Section: 13 Using Pipelined and Parallel Table Functions wrote:

    Rows from a collection returned by a table function can also be pipelined; this means that they are iteratively returned as they are produced, instead of being returned in a single batch after all processing of the table function's input is completed.

    Pipelining enables a table function to return rows faster and can reduce the memory required to cache a table function's results.


    It looks like that Pipelined function is a reflection of consuming row-by-row produced data as a memory (PGA) optimization of collection processing if intermediate result can be aggregated and consumed rows can be forgotten.

    By the way, Pipelined table functions can be implemented in two ways: native PL/SQL and interface. In the above example, we use the native PL/SQL approach.

    Oracle is a RDBMS based on relational model, where "relation" means "tabular", in mathematics, a set of objects (rows).

    Blog Ask TOM: Slow processing on database using merge declared:
        SQL is about sets, the best way to make SQL perform is to think SETS, not procedurally

    The first test with "USER ROWID" is the cheapest fetching approach. In real application, it can be implemented by first caching ROWID of previously fetched ROWs, and subsequently, make the direct table fetch, and therefore bypass INDEX access. (See: Blog: Hot Block Identification and latch: cache buffers chains Section: "5. Index Hot Block Application Fix")


    3.2 'latch: cache buffers chains' on Index Blocks


    Index is made of a part of table columns. Usually it is more condensed than table, i.e, each Branch/Leaf Block can contain more rows than table Block. In other words, the same index Block is more frequently touched than table Block, hence more prone to 'latch: cache buffers chains' when multi sessions concurrently access it.


    4. 'latch: cache buffers chains' Demo


    
    create or replace procedure index_rowid_pipelined_get(p_cnt number, p_sleep_seconds number := 0) as
    begin
      for c in (
        select /*+ leading(c) cardinality(c 10) indx(t test_tab#u1) index_rowid_pipelined */ * 
        from table(cast (id_pipelined(p_cnt, p_sleep_seconds) as number_tab)) c, test_tab t
        where c.column_value = t.id)
      loop
        null;
      end loop;
    end;
    /
    
    create or replace procedure index_rowid_pipelined_get_jobs(p_job_cnt number, p_cnt number, p_sleep_seconds number := 0) as
      l_job_id pls_integer;
    begin
      for i in 1.. p_job_cnt loop
        dbms_job.submit(l_job_id, 'begin while true loop index_rowid_pipelined_get('||p_cnt||', '||p_sleep_seconds||'); end loop; end;');
      end loop;
      commit;
    end;    
    /
    
    Launch 6 Jobs on a UNIX with 6 physical Processors:
    
    exec index_rowid_pipelined_get_jobs(6, 90, 0);
    
    All Blocks in V$BH have status 'xcur'.

    When modifying Table and no commit/rollback:
    
    update test_tab set id = -id where abs(id) between 3 and 12;   -- Without commit/rollback
    
    We can immediately observe 'latch: cache buffers chains' or 'buffer busy waits':
    
    select * from v$session_wait where event in ('latch: cache buffers chains') order by event, sid;
    
    select * from  v$session_wait_history where event in ('latch: cache buffers chains') order by event, sid;
    
    In case of 'latch: cache buffers chains', one guess is that when
      V$SESSION_WAIT.state = 'WAITED SHORT TIME' (less than one centisecond), V$LATCH.Spin_Gets increased
      V$SESSION_WAIT.state = 'WAITED KNOWN TIME' (more than one centisecond), V$LATCH.Sleeps increased
      no entries in V$SESSION_WAIT, but V$BH.status='cr' Block Read,          V$LATCH.immediate_gets increased
    
    but it is not clear how to confirm it.

    By querying v$bh, we can see Consistent Read Copy of Blocks (cloned) marked as 'cr':
    
    select d.object_name, data_object_id, b.status, b.* 
      from v$bh b, dba_objects d
     where b.objd = d.data_object_id
       and object_name in ('TEST_TAB', 'TEST_TAB#U1')
       --and b.status not in ('free')
       and b.status in ('xcur', 'cr')
    order by b.status, object_name, block#;
    
    In this case, Event 10201 ("consistent read undo application") can be used to trace CR Reads constructed by applying UNDO Blocks.

    To stop all Jobs, execute following "clean_jobs".

    
    -------------- clean_jobs --------------
    
    create or replace procedure clean_jobs as
    begin
      for c in (select job from dba_jobs) loop
        begin
           dbms_job.remove (c.job);
        exception when others then null;
        end;
        commit;
      end loop;
    
      for c in (select d.job, d.sid, (select serial# from v$session where sid = d.sid) ser 
                  from dba_jobs_running d) loop
        begin
          execute immediate
                 'alter system kill session '''|| c.sid|| ',' || c.ser|| ''' immediate';
          dbms_job.remove (c.job);
        exception when others then null;
        end;
        commit;
      end loop;
      
      -- select * from dba_jobs;
      -- select * from dba_jobs_running;
    end;
    /
    
    -- exec clean_jobs; -- stop all jobs
    

    Tuesday, October 31, 2017

    ORA-04025 library object lock on DATABASE Truncate Triggers

    In this Blog, we will discuss ORA-04025 Errors caused by DATABASE Truncate Triggers.

    Note 1. All tests are done in Oracle 12.1.0.2.0 on AIX, Solaris, Linux.
    Note 2. ORA-04025 Maximum Allowed Library Object Lock Allocated is also discussed in Blog:
                 ORA-04025 library object lock on AQ queue and SQL cursor


    1. DATABASE Truncate Triggers


    In Oracle 12c, there exist two DATABASE Triggers:
        XDB_PI_TRIG
        AW_TRUNC_TRG
    
    They can be discovered by:
    
    create or replace function getlong(p_trigger_name varchar2) return varchar2 as 
      l_data long; 
    begin 
      select trigger_body into l_data from dba_triggers where trigger_name = p_trigger_name; 
      return substr(l_data, 1, 4000);
    end; 
    /
    
    column owner format            a6
    column trigger_name format     a12
    column trigger_type format     a12
    column triggering_event format a15
    column status format           a8
    column description format      a40
    set wrap off
    
    SQL> select owner, trigger_name, trigger_type, triggering_event, status, description 
             --,getlong(trigger_name) trigger_body
         from dba_triggers where triggering_event like '%TRUNCATE%';
    
      OWNER  TRIGGER_NAME TRIGGER_TYPE TRIGGERING_EVEN STATUS   DESCRIPTION
      ------ ------------ ------------ --------------- -------- ----------------------------------------
      SYS    XDB_PI_TRIG  BEFORE EVENT DROP OR TRUNCAT ENABLED  sys.xdb_pi_trig
      SYS    AW_TRUNC_TRG AFTER EVENT  TRUNCATE        ENABLED  aw_trunc_trg AFTER TRUNCATE ON DATABASE
    
    
    set wrap on
    
    SQL> select getlong(trigger_name) trigger_body from dba_triggers where trigger_name = 'XDB_PI_TRIG';
    
    BEGIN
      BEGIN
        IF (sys.is_vpd_enabled(sys.dictionary_obj_owner, sys.dictionary_obj_name, xdb.DBMS_XDBZ.IS_ENABLED_CONTENTS)) THEN
          xdb.XDB_PITRIG_PKG.pitrig_truncate(sys.dictionary_obj_owner, sys.dictionary_obj_name);
        ELSIF (sys.is_vpd_enabled(sys.dictionary_obj_owner, sys.dictionary_obj_name, xdb.DBMS_XDBZ.IS_ENABLED_RESMETADATA)) THEN
          xdb.XDB_PITRIG_PKG.pitrig_dropmetadata(sys.dictionary_obj_owner, sys.dictionary_obj_name);
        END IF;
      EXCEPTION
        WHEN OTHERS THEN
         null;
      END;
    END;
    
    SQL> select getlong(trigger_name) trigger_body from dba_triggers where trigger_name = 'AW_TRUNC_TRG';
    
    BEGIN
      aw_trunc_proc(ora_dict_obj_type, ora_dict_obj_name, ora_dict_obj_owner);
    END;
    
    
    Note: ora_dict_obj_owner, ora_dict_obj_name, ora_dict_obj_type are published Attributes (public synonyms) of System Events.
          (replacing functions of earlier releases: sys.dictionary_obj_owner, sys.dictionary_obj_name, sys.dictionary_obj_type 
           defined in dbms_standard).
    
    


    2. Test


    Run following test (see appended Test Code):
    
    alter system set session_cached_cursors=2000 scope=spfile;
    alter system set processes=1100 scope=spfile;
    alter system set job_queue_processes=1000 scope=spfile;
    alter system set "_session_cached_instantiations"=1001 scope=spfile;
    
    -- restart DB
    SQL> startup force
    
    -- start Test by launching 66 Jobs (66 > floor(65535/1000)=65)
    SQL> exec gtt_truncate_job(66, 1e9);
    
    --after Test, clean jobs
    -- exec clean_jobs;
    
    During test, monitor v$db_object_cache.LOCKS for both triggers, we can see their increasing:
    
    column name format a80
    set wrap off
    
    SQL> select hash_value, locks, pins, locked_total, pinned_total, name  --v.*
           from v$db_object_cache v
          where locks > 1000
            and (hash_value in (2219505151, 679830350) 
             or (name like '%xdb.DBMS_XDBZ.%' or name like '%aw_trunc_proc%')); 
    
      HASH_VALUE  LOCKS  PINS LOCKED_TOTAL PINNED_TOTAL NAME
      ---------- ------ ----- ------------ ------------ --------------------------------------------------------------------------------
      2219505151  64141     0      1164253      1164234 BEGIN   BEGIN     IF (sys.is_vpd_enabled(sys.dictionary_obj_owner, sys.dictionar
      2219505151  64238     0      1164365            1 BEGIN   BEGIN     IF (sys.is_vpd_enabled(sys.dictionary_obj_owner, sys.dictionar
       679830350  64506     0      1164192      1164167 BEGIN   aw_trunc_proc(ora_dict_obj_type, ora_dict_obj_name, ora_dict_obj_owner);
       679830350  64602     0      1164289            1 BEGIN   aw_trunc_proc(ora_dict_obj_type, ora_dict_obj_name, ora_dict_obj_owner);
    
    After about 10 minutes, LOCKS increased closing to 65535 (66 sessions, each of which allocated about 1000 LOCKS).

    database alert.log shows ORA-04025 Errors caused by two triggers:
    
    ORA-04025: maximum allowed library object lock allocated for BEGIN
      aw_trunc_proc(ora_dict_obj_type, ora_dict_obj_name, ora_dict_obj_owner);
    END;
    
    ORA-04025: maximum allowed library object lock allocated for BEGIN
      BEGIN
        IF (sys.is_vpd_enabled(sys.dictionary_obj_owner, sys.dictionary_obj_name, xdb.DBMS_XDBZ.IS_ENABLED_CONTENTS)) THEN
          xdb.XDB_PITRIG_PKG.pitrig_truncate(sys.dictionary_obj_owner, sys.dictionary_obj_name);
        ELSIF (sys.is_vpd_enabled(sys.dictionary_obj_owner, sys.dictionary_obj_name, xdb.DBMS_XDBZ.IS_ENABLED_RESMETADATA)) THEN
          xdb.XDB_PITRIG_PKG.pitrig_dropmetadata(sys.dictionary_obj_owner, sys.dictionary_obj_name);
        END IF;
      EXCEPTION
        WHEN OTHERS THEN
         null;
      END;
    END;
    
    Both cursors can also be found by:
    
    column sql_text format a100
    set wrap off
    
    SQL > select sql_id, hash_value, executions, parse_calls, sql_text from v$sqlarea v 
          where sql_id in ('ct6c4h224pxgz', 'gx87sann8asuf') or hash_value in (2219505151, 679830350);
    
      SQL_ID        HASH_VALUE EXECUTIONS PARSE_CALLS SQL_TEXT
      ------------- ---------- ---------- ----------- --------------------------------------------------
      ct6c4h224pxgz 2219505151   13537973    13623969 BEGIN   BEGIN     IF (sys.is_vpd_enabled(sys.dicti
      gx87sann8asuf  679830350   13649026    13603140 BEGIN   aw_trunc_proc(ora_dict_obj_type, ora_dict_
    
    Blog: ORA-04025 library object lock on AQ queue and SQL cursor showed that Maximum Allowed Library Object Lock is hard limited to 65535. That is why at least 66 Jobs (66 > floor(65535/1000)=65) have to be launched to provoke ORA-04025 Error.

    Note that in the above test, we set
      session_cached_cursors=2000
    
    because of two DATABASE Truncate Triggers, each of both can acquire about 1000 LOCKs.

    As a justification, even we start one single session, LOCK_CNT is around 1000.
    
    SQL > exec clean_jobs;
     
    SQL > exec gtt_truncate_job(1, 1e9);
    
    SQL > with sq as
            (select /*+ materialize */ sid, program, saddr, event
                ,to_char(p1, 'xxxxxxxxxxxxxxx') p1, p1text
                ,to_char(p2, 'xxxxxxxxxxxxxxx') p2, p2text
                ,to_char(p3, 'xxxxxxxxxxxxxxx') p3, p3text
             from v$session
            where program like '%(J%')
          select o.kglnahsh, s.sid, holding_user_session, object_handle, l.type, mode_held, mode_requested,
                 count(*) lock_cnt     -- it goes up to about 1000
            from v$libcache_locks l, x$kglob o, sq s
           where 1= 1
             and o.kglnahsh in (2219505151, 679830350)     
             and l.object_handle = o.kglhdadr
             and l.holding_user_session = s.saddr
          group by o.kglnahsh, s.sid, holding_user_session, object_handle, l.type, mode_held, mode_requested
          order by lock_cnt desc, s.sid, holding_user_session, l.type, mode_held, mode_requested;
      
        KGLNAHSH  SID HOLDING_USER_SES OBJECT_HANDLE    TYPE  MODE_HELD MODE_REQUESTED   LOCK_CNT
      ---------- ---- ---------------- ---------------- ---- ---------- -------------- ----------
      2219505151  152 070001010D31BD08 07000100EDC018C8 LOCK          1              0       1012
      2219505151  152 070001010D31BD08 07000100ED2F4110 LOCK          1              0       1011
       679830350  152 070001010D31BD08 07000100ECDBA460 LOCK          1              0       1009
       679830350  152 070001010D31BD08 07000100ECDBBF28 LOCK          1              0       1009
    


    3. "library cache: mutex X" and "cursor: pin X"


    During test, heavy "library cache: mutex X" and "cursor: pin X" are observed on GTT truncate statement:
    
    truncate table my_gtt
     
    SQL > select mutex_identifier, mutex_type, count(*) cnt from v$mutex_sleep_history 
          group by mutex_identifier, mutex_type 
          order by cnt desc;
    
      MUTEX_IDENTIFIER MUTEX_TYPE      CNT
      ---------------- -------------- ----
             167539573 Library Cache   278
             167539573 Cursor Pin       81
       
    SQL > select hash_value, locks, pins, locked_total, pinned_total, invalidations, name 
          from v$db_object_cache where hash_value in (167539573);
    
      HASH_VALUE  LOCKS  PINS LOCKED_TOTAL PINNED_TOTAL INVALIDATIONS NAME
      ---------- ------ ----- ------------ ------------ ------------- ---------------------
       167539573      1     0        14295        26690         13002 truncate table my_gtt
       167539573      2     0        14309        26510         13026 truncate table my_gtt
       167539573      1     0        14134        26218         12874 truncate table my_gtt
       ......
     
    SQL > select sql_id, hash_value, executions, parse_calls, invalidations, sql_text 
          from v$sqlarea v where sql_id in ('1jh104s4zswvp') or hash_value in (167539573); 
     
      SQL_ID        HASH_VALUE EXECUTIONS PARSE_CALLS INVALIDATIONS SQL_TEXT
      ------------- ---------- ---------- ----------- ------------- ---------------------
      1jh104s4zswvp  167539573         86          90       1325129 truncate table my_gtt
    
    It seems that "library cache: mutex X" is caused by dynamic SQL statement "EXECUTE IMMEDIATE" because GTT truncate is executed by:
        execute immediate 'truncate table my_gtt'
    
    and both above DATABASE Truncate Triggers are probably also called by "EXECUTE IMMEDIATE" since v$db_object_cache.NAME and v$sqlarea.SQL_TEXT are formatted in lower and upper mixed cases.

    See Blogs:
        Divide and conquer the true mutex contention
        "library cache: mutex X" and Application Context
    for more discussions.


    4. XDB_PI_TRIG Trigger Components and Cost


    Run GTT truncate with SQL Trace 10046:
    
    alter trigger sys.AW_TRUNC_TRG disable;
    alter trigger sys.XDB_PI_TRIG enable;
    
    SQL > alter system flush shared_pool; 
    SQL > alter session set events '10046 trace name context forever, level 12';
    SQL > exec gtt_truncate(1);
    SQL > alter session set events '10046 trace name context off';
    
    We can see all the TOP statements:
    
    --SQL_ID: 86kwhy1f0bttn 
    SELECT /*+ ALL_ROWS */
          COUNT (*)
      FROM DBA_POLICIES V
     WHERE     V.OBJECT_OWNER = :B3
           AND V.OBJECT_NAME = :B2
           AND (V.POLICY_NAME LIKE '%xdbrls%' OR V.POLICY_NAME LIKE '%$xd_%')
           AND V.FUNCTION = :B1
           
    --SQL_ID: 1jh104s4zswvp 
    truncate table my_gtt
           
    --SQL_ID: 1u4r80k42x8jj 
    select count(FA#) from SYS_FBA_TRACKEDTABLES where OBJ# = 1001511 and bitand(FLAGS, 128)=0
    
    --SQL_ID: au7qd3c3yuk1u 
    LOCK TABLE "MY_GTT" IN EXCLUSIVE MODE WAIT 5
    
    --SQL_ID: 9ff116syaf279 
    SELECT TO_NUMBER(XS_SYS_CONTEXT('XS$SESSION', 'SESSION_XS_USER_GUID')) FROM DUAL
    
    --SQL_ID: au7qd3c3yuk1u 
    LOCK TABLE "MY_GTT" IN EXCLUSIVE MODE WAIT 5
    
    In fact, our GTT table: my_gtt is irrelevant to XDB:
    
    SELECT /*+ ALL_ROWS */
          *
      FROM DBA_POLICIES V
     WHERE     V.OBJECT_OWNER = 'XDB'
         --AND V.OBJECT_NAME = 'MY_GTT'
           AND (V.POLICY_NAME LIKE '%xdbrls%' OR V.POLICY_NAME LIKE '%$xd_%')
           AND V.FUNCTION in ('CHECKPRIVRLS_SELECTPF', 'CHECKPRIVRLS_SELECTPROPF');         
    


    5. Query against SYS_FBA_TRACKEDTABLES


    We also noticed one query on SYS_FBA_TRACKEDTABLES having TOP Executions and Parse Calls. Both number are very close, which means that each Execution requires a Parse Call.
    
    SQL > select sql_id, hash_value, executions, parse_calls, sql_text from v$sqlarea v 
    where sql_id in ('1u4r80k42x8jj') or hash_value in (2284757553);
    
      SQL_ID        HASH_VALUE EXECUTIONS PARSE_CALLS SQL_TEXT
      ------------- ---------- ---------- ----------- -------------------------------------------------------------------------------------------
      1u4r80k42x8jj 2284757553   35663414    35536924 select count(FA#) from SYS_FBA_TRACKEDTABLES where OBJ# = 1001511 and bitand(FLAGS, 128)=0
    
    -- where 1001511 is MY_GT
    
    SQL > select object_name, object_id, object_type from dba_objects where object_id = 1001511;
    
      OBJECT_NAME   OBJECT_ID  OBJECT_TYPE
      -----------  ---------- ------------
      MY_GTT          1001511        TABLE
    
    Searching in Oracle MOS, it looks like:
         Bug 15931756 - ORA-4031 / Queries against SYS_FBA_TRACKEDTABLES not shared (do not use binds) (Doc ID 15931756.8)


    6. Disable/Drop Triggers


    Trigger XDB_PI_TRIG is applied for Oracle XML Database (XDB), and AW_TRUNC_TRG for OLAP Analytic Workspaces(AW). If they are not installed/used, one can disable/drop them. These can be checked by:
    
    select comp_name, version, status from dba_registry order by 1;
    select * from v$option where parameter = 'OLAP';  
    select * from dba_feature_usage_statistics where name like '%XDB%' or name like '%OLAP%';
    select * from dba_aws;
    
    (Note: XDB_PI_TRIG trigger moved from schema XDB to SYS with PSU 11.1.0.7.16)  
    
    It is not clean why AW_TRUNC_TRG got installed although we do not buy/pay the license. Besides, the table name MY_GTT is not like 'AW$_%'.


    7. Boundary Value and Cursor Leak ?


    Set session_cached_cursors to Boundary Value: 65535:
    
    alter system set session_cached_cursors=65535 scope=spfile;
    alter system set "_session_cached_instantiations"=1001 scope=spfile;
    alter system set open_cursors=400 scope=spfile;
    alter system set cursor_sharing=exact scope=spfile;
    
    -- enable only one single truncate trigger
    alter trigger sys.AW_TRUNC_TRG disable;  
    alter trigger sys.XDB_PI_TRIG  enable;
    
    -- Restart DB
    --SQL> startup force
    
    Open one Sqlplus session and run (it takes about 10 minutes):
    
    SQL (567) > exec gtt_truncate(65535);
    
    Immediately after its finished, run following queries about Library Cache and Cursor, and look the output:
    
    set lines 400
    column name format a50
    set wrap off
    
    select hash_value, locks, pins, locked_total, pinned_total, executions, child_latch, name
      from v$db_object_cache v
     where (hash_value in (2219505151) or (name like 'BEGIN%xdb.DBMS_XDBZ.%')) 
       and locks > 100; 
       
      HASH_VALUE  LOCKS  PINS LOCKED_TOTAL PINNED_TOTAL EXECUTIONS CHILD_LATCH NAME
      ---------- ------ ----- ------------ ------------ ---------- ----------- --------------------------------------------------
      2219505151  65532     0       131070       131072      65535           0 BEGIN   BEGIN     IF (sys.is_vpd_enabled(sys.dicti
      2219505151  65533     0       131073            1      65535       62975 BEGIN   BEGIN     IF (sys.is_vpd_enabled(sys.dicti
    
    
    select hash_value, sql_id, child_number, users_opening, executions, parse_calls, invalidations
    from  v$sql v
    where (sql_id in ('ct6c4h224pxgz') or hash_value in (2219505151))
      and users_opening > 100;   
      
      HASH_VALUE SQL_ID        CHILD_NUMBER USERS_OPENING EXECUTIONS PARSE_CALLS INVALIDATIONS
      ---------- ------------- ------------ ------------- ---------- ----------- -------------
      2219505151 ct6c4h224pxgz            1         65531      65535       65535             1  
    
    
    select count(*) from v$open_cursor
    where (sql_id in ('ct6c4h224pxgz') or hash_value in (2219505151))
      and sid in (567);
      
       COUNT(*)
      ---------
          65530 
        
    select * from v$open_cursor
    where (sql_id in ('ct6c4h224pxgz') or hash_value in (2219505151))
      and sid in (567) 
      and rownum <= 3;  
    
       SID HASH_VALUE SQL_ID        SQL_TEXT                              LAST_SQL SQL_EXEC CURSOR_TYPE                     CHILD_ADDRESS   
      ---- ---------- ------------- ------------------------------------- -------- -------- ------------------------------- ----------------
       567 2219505151 ct6c4h224pxgz BEGIN   BEGIN     IF (sys.is_vpd_enab                   DICTIONARY LOOKUP CURSOR CACHED 000000017636BA68
       567 2219505151 ct6c4h224pxgz BEGIN   BEGIN     IF (sys.is_vpd_enab                   DICTIONARY LOOKUP CURSOR CACHED 000000017636BA68
       567 2219505151 ct6c4h224pxgz BEGIN   BEGIN     IF (sys.is_vpd_enab                   DICTIONARY LOOKUP CURSOR CACHED 000000017636BA68
    
    
    select s.program, s.sid, n.name p_name, t.value
    from v$session s, v$sesstat t, v$statname n 
    where s.sid=t.sid and n.statistic# = t.statistic# 
      and name = 'session cursor cache count'
      and s.sid in (567);
      
      PROGRAM       SID  P_NAME                      VALUE
      ------------ ---- --------------------------- ------
      sqlplus.exe   567  session cursor cache count  65535
      
          
    select sql_id, child_number from v$sql_shared_cursor
    where sql_id in ('ct6c4h224pxgz');
    
      SQL_ID        CHILD_NUMBER
      ------------- ------------
      ct6c4h224pxgz            0
      ct6c4h224pxgz            1      
    
    The above result shows:
      1. v$db_object_cache.LOCKS is close to 65535
      2. v$sql.USERS_OPENING is close to 65535
      3. v$open_cursor count is close to 65535
      4. 'session cursor cache count' is 65535
    
    Contrary to Blog: ORA-04025 library object lock on AQ queue and SQL cursor
      1. SESSION_CACHED_CURSORS (65535) sets hard limit of LIBRARY OBJECT LOCK, 
         instead of "_session_cached_instantiations".
      2. OPEN_CURSORS has no effect. 
    
    v$open_cursor shows that one single session for one single cursor has 65535 CURSORS (opened and parsed, or cached), all marked with CURSOR_TYPE: "DICTIONARY LOOKUP CURSOR CACHED".
    65535 is clearly over open_cursors=400, and according to Oracle Doc:
        OPEN_CURSORS specifies the maximum number of open cursors (handles to private SQL areas) a session can have at once.
    It is not clear if this a Cursor Leak.

    (MOS: Bug 25477055: INCORRECT CURSOR_TYPE IN V$OPEN_CURSOR FOR SESSION CACHED CURSOR said:
        "DICTIONARY LOOKUP CURSOR CACHED" is misreported, it should be "SESSION CURSOR CACHED").

    Immediately after above gtt_truncate(65535) terminated, run following statement in another session, it will raise ORA-04025:
    
    SQL (1124)> exec gtt_truncate(100);
    
      ORA-04025: maximum allowed library object lock allocated for BEGIN
      BEGIN
      IF (sys.is_vpd_enabled(sys.dictionary_obj_owner, sys.dictionary_obj_name, xdb.DBMS_XDBZ.IS_ENABLED_CONTENTS)) THEN
      xdb.XDB_PITRIG_PKG.pitrig_truncate(sys.dictionary_obj_owner, sys.dictionary_obj_name);
      ELSIF (sys.is_vpd_enabled(sys.dictionary_obj_owner, sys.dictionary_obj_name, xdb.DBMS_XDBZ.IS_ENABLED_RESMETADATA)) THEN
      xdb.XDB_PITRIG_PKG.pitrig_dropmetadata(sys.dictionary_obj_owner, sys.dictionary_obj_name);
      END IF;
      EXCEPTION
      WHEN OTHERS THEN
      null;
      END;
      END;  
    


    8. Simple Truncate Trigger Reproducing


    We will use a (almost) empty Truncate Trigger to demonstrate that ORA-04025 is caused by the existence of Truncate Trigger, but irrelevant to its substances.

    Disable (or Drop) both Oracle Truncate Triggers:
    
    alter trigger sys.AW_TRUNC_TRG disable;
    alter trigger sys.XDB_PI_TRIG  disable;
    
    Create a new trivial Trigger:
    
    create or replace trigger SYS.KSUN_TEST_TRIG
    before drop or truncate ON database 
      BEGIN
        IF 1=2 THEN
          null;
        END IF;
      END;
    /
    
    Configure session_cached_cursors to maximum value:
    
    alter system set session_cached_cursors=65535 scope=spfile;
    alter system set "_session_cached_instantiations"=1001 scope=spfile;
    alter system set open_cursors=400 scope=spfile;
    alter system set cursor_sharing=exact scope=spfile;
    
    Run test:
    
    SQL> exec gtt_truncate_job(2, 1e9);
    
    Monitor v$db_object_cache.LOCKS:
    
    column name format a80
    set wrap off
    
    SQL> select hash_value, locks, pins, locked_total, pinned_total, executions, name  --v.*
           from v$db_object_cache v
          where locks > 1000
            and (name like '%1=2%' or name like '%KSUN_TEST_TRIG%');
            
      HASH_VALUE  LOCKS  PINS LOCKED_TOTAL PINNED_TOTAL EXECUTIONS NAME
      ---------- ------ ----- ------------ ------------ ---------- -----------------------------------------------------
      2564248212  65514     0        65516        65517      65516 BEGIN     IF 1=2 THEN       null;     END IF;   END;
      2564248212  65514     0        65516            1      65515 BEGIN     IF 1=2 THEN       null;     END IF;   END;        
    
    and alert.log shows:
            
    ORA-04025: maximum allowed library object lock allocated for BEGIN
        IF 1=2 THEN
          null;
        END IF;
      END;
    


    9. Test Code


    
    create global temporary table my_gtt (id number) on commit preserve rows;
    
    create or replace procedure gtt_truncate(p_cnt number) is
    begin
     for i in 1..p_cnt loop
        execute immediate 'truncate table my_gtt';
        --dbms_lock.sleep(0.01);
      end loop;
    end;
    /
    
    create or replace procedure gtt_truncate_job(p_job_cnt number, p_cnt number) as
      l_job_id pls_integer;
    begin
      for i in 1..p_job_cnt loop
        dbms_job.submit(l_job_id, 'begin while true loop gtt_truncate('||p_cnt||'); end loop; end;');
      end loop;
      commit;
    end;    
    /
    
    create or replace procedure clean_jobs as
    begin
      for c in (select job from dba_jobs) loop
        begin
           dbms_job.remove (c.job);
        exception when others then null;
        end;
        commit;
      end loop;
    
      for c in (select d.job, d.sid, (select serial# from v$session where sid = d.sid) ser 
                  from dba_jobs_running d) loop
        begin
          execute immediate
                 'alter system kill session '''|| c.sid|| ',' || c.ser|| ''' immediate';
          dbms_job.remove (c.job);
        exception when others then null;
        end;
        commit;
      end loop;
      
      -- select * from dba_jobs;
      -- select * from dba_jobs_running;
    end;
    /