On Oracle: Oracle Latch Misses, Spin_gets, Sleeps and Recurrent

After the correction of Blog: Is latch misses statistic gathered or deduced ? I re-examined a few hourly AWR reports in one heavily loaded system (Oracle 12.1.0.2.0).

1. Latch Statistics

We observed two distinctive Sets of Latch Statistics in AWR reports.

Here AWR Top Events Section and "Latch Sleep Breakdown" Section in both Cases.

1.1. Case-1: Recurrent Spin_gets

Top 10 Foreground Events by Total Wait Time

Event	Waits	Total Wait Time (sec)	Avg wait (ms)	% DB time	Wait Class
latch: cache buffers chains	1,078,528	291.7K	270.50	37.6	Concurrency
latch: row cache objects	423,293	78.2K	184.86	10.1	Concurrency

Latch Sleep Breakdown

ordered by misses desc

Latch Name	Get Requests	Misses	Sleeps	Spin Gets
cache buffers chains	15,134,617,472	176,939,147	1,040,507	314,665,205
row cache objects	101,225,008	4,075,626	429,314	3,673,739
Result Cache: RC Latch	7,876,403	44,622	25,380	19,440

Look "cache buffers chains" (CBC) statistics:

     Misses    = 176,939,147
     Sleeps    =   1,040,507
     Spin Gets = 314,665,205

Spin_Gets(314,665,205) > Misses(176,939,147) attested the existence of recurrent Spin_Gets.

1.2. Case-2: Recurrent Sleeps

Top 10 Foreground Events by Total Wait Time

Event	Waits	Total Wait Time (sec)	Avg wait (ms)	% DB time	Wait Class
latch: cache buffers chains	1,630,692	748.3K	458.90	92.7	Concurrency
latch: row cache objects	314,199	23.9K	76.14	3.0	Concurrency

Latch Sleep Breakdown

ordered by misses desc

Latch Name	Get Requests	Misses	Sleeps	Spin Gets
row cache objects	55,852,812	3,505,131	320,033	3,224,234
cache buffers chains	1,722,527,053	2,564,938	2,819,003	1,026,077
kokc descriptor allocation latch	5,984,155	343,396	17,985	326,266

Look "cache buffers chains" (CBC) statistics:

     Misses    = 2,564,938
     Sleeps    = 2,819,003
     Spin Gets = 1,026,077

Sleeps(2,819,003) > Misses(2,564,938) evidenced the existence of recurrent Sleeps.

2. Recurrent_Misses

Every unsuccessful Spin_Get will trigger a Sleep (or each Sleep follows a Spin_Get), every unsuccessful Sleep will result in a successive longer Sleep.

Latch Sleep is started with an initial duration of 10ms (or 1ms), and progressively elevated after every unsuccessful Sleep (See later discussion of V$LATCH.sleep[1-11], V$EVENT_HISTOGRAM.wait_time_milli, and pseudo-code in Book: Oracle Performance Firefighting Page 79).

Suppose initial duration being 10ms, Avg wait (ms) = 270.50 (Case-1) indicates that there exist Sleeps, which are not initial Sleeps, hence recurrent Sleeps.

For OS scheduler Time Slice of 10ms, Avg wait (ms) = 270.50 signifies that majority of Sleeps spans more than 20 Ticks, probably recurrent Sleeps (also possible not awakened from waitqueue, i.e., not scheduled into runqueue).

Recurrent_Misses can be computed by:

Case-1:  
    spin_gets + sleeps - misses = 314,665,205 + 1,040,507 - 176,939,147 = 138,766,565 > 0

Case-2:
    spin_gets + sleeps - misses = 1,026,077 + 2,819,003 - 2,564,938 = 1,280,142 > 0

Therefore, above formula could be refined as:

       sleeps + spin_gets – misses 
     = recurrent_misses 
     = recurrent_sleeps + recurrent_spin_gets

Look proportion of recurrent Misses:

    Case-1: recurrent_misses / Misses amounts to 78.4% (138,766,565 / 176,939,147)
    Case-2: recurrent_misses / Misses amounts to 49.9% (1,280,142 / 2,564,938)

both are serious cases of latch misses.

3. Observations

(1). Latch Sleep Breakdown Table shows:
         Case-1: "cache buffers chains": Spin_Gets(314,665,205) > Misses(176,939,147)
         Case-2: "cache buffers chains": Sleeps(2,819,003)           > Misses(2,564,938)
   
   Hence there exist Spin_Get and Sleep which are not followed by one success Get, that is, recurrent Miss.
   Or we can say that each Miss can trigger multi Spin_Gets or Sleeps.
   
(2). Sleeps can be bigger than Spin_Gets, for example, 
      Case-1: "Result Cache: RC Latch": Sleeps(25,380) > Spin_Gets(19,440)
   
(3). Usually when one Latch has very slight contention, 
       misses = sleeps + spin_gets
   It means that each Miss is resolved either by one Spin_Get, or by one Sleep.
   This also indicates that immediately after each Sleep (OS Process waking up and rescheduled), Oracle makes one Latch Get.
   
(4). Top 10 Foreground Events Table shows:
      Latch Sleeps is approximate to Event Waits, i.e V$LATCH.sleeps = V$SYSTEM_EVENT.total_waits.
   It signifies that only Latch Sleeps is counted as Waits because Process is on the wait queue. 
   Latch Spin_Gets is not counted because Process is on CPU (runqueue).

By the above observations, it is not clear how Oracle makes the counting of Misses, Sleeps and Spin_Gets in willing-to-wait mode. And it is hard to reconstruct a pseudo-code to match above countings.

4. Discussions

(1). Oracle session's Response Time is made of Service Time and Queue Time. Spin_Gets is counted as Service Time since it is on CPU.
   Whereas Sleeps is categorized as Queue Time since it is on waiting. For details, see Book: 
   Oracle Performance Firefighting Page 74-91.

(2). As discussed, generally, Latch Spins is burning CPU; whereas Latch Sleeps yields CPU. Therefore when talking about Latch contentions, 
   it would be necessary to distinguish between Spin_gets and Sleeps. 
   As tested, usually Spin_gets of Latch Misses are caused by frequently concurrent access; whereas Sleeps of Latch Misses are triggered 
   by Invalidations or DML Modifications. 
   
(3). Oracle V$LATCH SLEEP evaluated, which leaks the algorithm changing.
 
   In Oracle 9i, V$LATCH is documented as:
     SLEEP[1|2|3]           Waits that slept 1 time through 3 times, respectively
     SLEEP4                   Waits that slept 4 or more times
     SLEEP[5|6|7|8|9|10|11] present for compatibility with previous releases of Oracle. No data is accumulated.
     
   Since Oracle 10g, V$LATCH is updated as:
     SLEEP[1|2|3|4|5|6|7|8|9|10|11] deprecated. As a substitute, query V$EVENT_HISTOGRAM for EVENT like latch free or latch:%.                                 
   
   So the original design elevated SLEEP in 11 grades, with 9i, it was reduced to 4, and 10g regulated it in a new approach. 
   
(4). Suppose OS scheduler Time Slice is 10ms (system clock tick frequency of 100 hertz), 
   Wait Event "Avg wait (ms)"/10 represents the paused ticks before placed back into the runqueue.
   
(5). There are two sorts of Latches, one has children (see V$LATCH_CHILDREN), for example, "cache buffers chains";
   other one has no children and hence an instance_wide single Latch, for example, "Result Cache: RC Latch".
   Therefore, when comparing Sleeps and Spin_Gets among Latches, the number of children should be taken into account.
   For example, single Latch can serialize whole system.

With following query, we can see number of Blocks protected per CBC Child Latch:


with block_size as (select value b_val from v$parameter where name in ('db_block_size'))
    ,cache_size as (select sum(value) c_val from v$parameter where name in ('db_cache_size', 'db_keep_cache_size', 'db_recycle_cache_size'))
    ,latch_cnt  as (select count(*) cnt from v$latch_children where name in ('cache buffers chains'))
select b_val, round(c_val/1024/1024) cache_size_mb, l.cnt, ceil(c.c_val/b_val/cnt) block_per_latch
from block_size b, cache_size c, latch_cnt l;

5. Queries

-- v$latch
select name, misses, sleeps, spin_gets
      ,sign((sleeps + spin_gets) - misses) miss_deduced_sign
      ,((sleeps + spin_gets) - misses)     recurrent_misses
      ,trunc((abs((sleeps + spin_gets) - misses) / nullif(misses, 0)) * 100 ,2)*100 recurrent_misses_perc
from v$latch where misses > 1000
order by misses desc;

-- v$system_event
select event, total_waits, round(time_waited_micro/1e6, 2) time_waited_sec, average_wait*1e4 average_wait_micro, e.*
from v$system_event  e
where event in ('latch free', 'latch: shared pool', 'latch: row cache objects', 'latch: cache buffers chains')
order by e.event;

-- v$session_event
select sid, event, total_waits, round(time_waited_micro/1e6, 2) time_waited_sec, average_wait*1e4 average_wait_micro, e.*
from v$session_event e
where e.event in ('latch free', 'latch: shared pool', 'latch: row cache objects', 'latch: cache buffers chains')
 --and sid in (217, 1217)
order by e.event, e.sid;

-- dba_hist_latch (One can also query dba_hist_latch_parent, dba_hist_latch_children)
select snap_id End_Snap, end_interval_time Snap_Time, latch_name Latch_Name
      ,gets_d       Get_request
      ,misses_d     Misses
      ,sleeps_d     Sleeps
      ,spin_gets_d  Spin_Gets
      --,v.*
from (
select gets-lag(gets) over(partition by latch_name order by e.snap_id) gets_d
      ,misses-lag(misses) over(partition by latch_name order by e.snap_id) misses_d
      ,sleeps-lag(sleeps) over(partition by latch_name order by e.snap_id) sleeps_d
      ,spin_gets-lag(spin_gets) over(partition by latch_name order by e.snap_id) spin_gets_d
      ,h.end_interval_time, e.* 
from dba_hist_latch e, dba_hist_snapshot h
where e.snap_id = h.snap_id 
  and e.latch_name in ('Result Cache: RC Latch', 'cache buffers chains', 'row cache objects')
  order by e.snap_id desc
) v
--where snap_id = 1217
order by snap_id desc, latch_name, gets_d desc;

-- dba_hist_system_event. Similar to AWR "Top 10 Foreground Events by Total Wait Time"
select snap_id End_Snap, end_interval_time Snap_Time,   event_name Event
      ,total_waits_d                                               Waits
      ,round(time_waited_micro_d/1e6,2)                            Total_Wait_Time_sec
      ,round(time_waited_micro_d/1e3/(nullif(total_waits_d, 0)),2) Wait_Avg_ms
      --,v.*
from (
select total_waits-lag(total_waits) over(partition by event_name order by e.snap_id) total_waits_d
      ,total_timeouts-lag(total_timeouts) over(partition by event_name order by e.snap_id) total_timeouts_d
      ,time_waited_micro-lag(time_waited_micro) over(partition by event_name order by e.snap_id) time_waited_micro_d
      ,h.end_interval_time, e.* 
from dba_hist_system_event e, dba_hist_snapshot h
where e.snap_id = h.snap_id 
  and e.event_name in ('latch free', 'latch: shared pool', 'latch: row cache objects', 'latch: cache buffers chains')
  order by e.snap_id desc
) v
where snap_id = 1217;
order by snap_id desc, total_waits_d desc;

On Oracle

Monday, January 8, 2018

Oracle Latch Misses, Spin_gets, Sleeps and Recurrent_Misses

1. Latch Statistics

1.1. Case-1: Recurrent Spin_gets

Top 10 Foreground Events by Total Wait Time

Latch Sleep Breakdown

1.2. Case-2: Recurrent Sleeps

Top 10 Foreground Events by Total Wait Time

Latch Sleep Breakdown

2. Recurrent_Misses

3. Observations

4. Discussions

5. Queries