Saturday, May 2, 2020

ORA-600 [504] LATCH HIERARCHY ERROR Deadlock

Latches in Oracle are positioned in different levels (15 levels from 0 to 16 in Oracle 12c and 19c) in order to enforce ordering rules. If not complied, "LATCH HIERARCHY ERROR ORA-600 [504]" is raised and connection is aborted.

The error is well documented in MOS Docu:
    ORA-600 [504] "Trying to obtain a latch which is already held" (Doc ID 28104.1)
    
      This ORA-600 triggers when a latch request violates the ordering 
      rules for obtaining latches and granting the latch would potentially 
      result in a deadlock.
    
      These ordering rules are enforced by assigning each latch a "level#", 
      and checking requested-level vs owned-levels on each latch request. 
     
    ARGUMENTS:
      Arg [a] the requested latch address
      Arg [b] bit array of latches owned
      Arg [c] the level of the requested latch
      Arg [d] the name of the requested latch
      Arg [e] the child latch number of the requested latch
      Arg [f] if non zero, allows 2 wait gets of children
      Arg [g] address of latch owned at this level
For example,
    ORA-00600: internal error code, arguments: [504], 
               [0x18C5C8AD8], [32], [4], [enqueue hash chains], [6], [0x000000000], [], [], [], [], []
The above docu describes the latch ordering level is designed to prevent latch deadlocks when multiple latches are requested (the similar mechanism can be found in UNIX kernel code). Once a process acquires a latch of level x, it can only acquire another latch of level higher than x (monotonically increasing).

The docu also indicates that one session can simultaneously hold two or more latches as long as no ordering rule violation ("ordering" or sorting only applies if there are two or more elements).

For example, Blog Blog Oracle Redo Strand described two redo latches:
Generate Redo Entry in PGA 
    -> Server process request Redo Copy latch (which is multiple, 2*CPU_COUNT default) 
         -> Server process request redo allocation latch 
              -> allocate log buffer 
         -> release redo allocation latch 
         -> Copy Redo Entrys into Log Buffer 
  -> release Redo Copy latch
"redo allocation" is requested inside "redo copy" since the former is LEVEL# 5, the latter is LEVEL# 4,
Whereas IMU (In-memory undo buffers) is a prerequisite for using a private strand, and it has LEVEL# 0.
Therefore, the request ordering is specified by their LEVEL# as follows:

select name, level# from  v$latch where name in ('redo copy', 'redo allocation', 'In memory undo latch') order by level#; 

  NAME                  LEVEL#
  --------------------  ------
  In memory undo latch	    0
  redo copy                 4
  redo allocation           5

select name, level#, count(*) from v$latch_children where name in ('redo copy', 'redo allocation', 'In memory undo latch') 
group by name, level# order by level#; 

  NAME                 LEVEL#  COUNT(*)
  -------------------- ------  --------
  In memory undo latch      0       118
  redo copy                 4        12
  redo allocation           5       120
"Arg [f]" reveals the design consideration to "enqueue" the second willing-to-wait mode Get (in addition to the current one).

In this Blog, we will create a use case to demonstrate the error and the unexpectedly impact on real application operations.

Note: Tested in Oracle 12c, 19c


1. Test Setup


First, we create a global application context, and allocate a user lock (insert one row into sys.dbms_lock_allocated) so that two latches:
    'global ctx hash table latch' (Level 5) 
    'enqueue hash chains' (Level 4) 
will be requested in the test.

create or replace context my_ctx using my_pkg accessed globally;
/

create or replace package my_pkg is
 procedure set_ctx(p_name varchar2, p_value varchar2);
end;
/

create or replace package body my_pkg is
 procedure set_ctx(p_name varchar2, p_value varchar2) is
 begin
   dbms_session.set_context('my_ctx', p_name, p_value);
 end set_ctx;
end;
/

declare
  o_lock_hdl   varchar2(128);
begin
  dbms_lock.allocate_unique(lockname => 'MY_ULOCK', lockhandle => o_lock_hdl);
end;
/ 

---- Smoking Tests ----
exec my_pkg.set_ctx('my_lockname', 'MY_ULOCK');

select sys_context('my_ctx', 'my_lockname') var1 from dual;

select * from sys.dbms_lock_allocated where name = 'MY_ULOCK';

select addr, latch#, level#, gets, misses, sleeps, spin_gets, wait_time 
from v$latch where name = 'global ctx hash table latch';


2. Blocking Test


Open 3 Sqlplus sessions: SID1 (SPID 19944), SID2(SPID 19946), SID3. First two are test sessions, third one is a monitor session (Test in Oracle 12c on Solaris).

In SID3, run code below to display 'global ctx hash table latch':

alter system flush shared_pool;
alter system flush buffer_cache;

select addr, latch#, level#, gets, misses, sleeps, spin_gets, wait_time 
  from v$latch where name = 'global ctx hash table latch';

  ADDR     LATCH# LEVEL# GETS MISSES SLEEPS  SPIN_GETS  WAIT_TIME
  -------- ------ ------ ---- ------ ------ ---------- ----------
  60046158    429      5 6431      8      6          2 1815729481

select * from v$latch_misses 
 where parent_name = 'global ctx hash table latch' 
   and (sleep_count > 0 or wtr_slp_count > 0);

  PARENT_NAME                 WHERE                       NWFAIL_COUNT SLEEP_COUNT WTR_SLP_COUNT LONGHOLD_COUNT LOCATION                   
  --------------------------- --------------------------- ------------ ----------- ------------- -------------- ---------------------------
  global ctx hash table latch kzctxgscv: allocate context            0           6             0              5 kzctxgscv: allocate context
  global ctx hash table latch kzctxggcv: get context val             0           0             6              0 kzctxggcv: get context val   
For SID1 (SPID 19944), we also open an UNIX terminal, and run Dtrace:

sudo dtrace -w -n  \
'
pid$target::kslfre:entry /arg0 == 0x60046158/  {
  printf("\n%s:%s (Addr=>0x%X, Mode=>%d, PID=>0x%X)", probefunc, probename, arg0, arg4, arg5);
  @CNT[ustack(10, 0)] = count();
  stop(); exit(0);
  }
' -p 19944 
Then set context by:

SID1 (SPID 19944) > exec my_pkg.set_ctx('my_lockname', 'MY_ULOCK');
Immediately we can see SID1 was suspended, and Dtrace output displays latch request parameter and call stack. It shows that latch is requested in "Mode=>16" (probably exclusive mode).

  dtrace: description '
  pid$target::kslfre:entry ' matched 1 probe
  dtrace: allowing destructive actions
   CPU     ID                    FUNCTION:NAME
     4  81237                     kslfre:entry
  kslfre:entry (Addr=>0x60046158, Mode=>16, PID=>0x2000000000000018)
  
      oracle`kslfre
      oracle`kzctxgset+0x195
      oracle`kzctxgscv+0x5e7
      oracle`kzctxesc+0x789
      oracle`pevm_icd_call_common+0x29d
      oracle`pfrinstr_ICAL+0x90
      oracle`pfrrun_no_tool+0x12a
      oracle`pfrrun+0x4c0
      oracle`plsql_run+0x288
      oracle`peicnt+0x946
Run the same queries in SID3 again to show latch usage statistics (latch GETS increasing 1) and latch holder:

select addr, latch#, level#, gets, misses, sleeps, spin_gets, wait_time 
from v$latch where name = 'global ctx hash table latch';

  ADDR     LATCH# LEVEL# GETS MISSES SLEEPS  SPIN_GETS  WAIT_TIME
  -------- ------ ------ ---- ------ ------ ---------- ----------
  60046158    429      5 6432      8      6          2 1815729481

select * from v$latchholder;

  PID  SID LADDR    NAME                        GETS 
  --- ---- -------- --------------------------- ---- 
   24   10 60046158 global ctx hash table latch 6432 
Now in SID2, we run a Plsql block below to access global context to return lock_name, and then get user lock with it.

SID2 (SPID 19946) > 
  declare 
   l_lockname   varchar2(1000);
   l_lockid     number;
  begin
   l_lockname := sys_context('my_ctx', 'my_lockname'); 
   select lockid into l_lockid from sys.dbms_lock_allocated where name = l_lockname;
  end;
  / 
It is blocked by SID1 on 'global ctx hash table latch' request. Probably since SID1 held it in exclusive mode ("Mode=>16"). We can run queries in SID3 to display the blocking chain.

select * from v$latch_misses 
 where parent_name = 'global ctx hash table latch' 
   and (sleep_count > 0 or wtr_slp_count > 0);
   
  PARENT_NAME                 WHERE                       NWFAIL_COUNT SLEEP_COUNT WTR_SLP_COUNT LONGHOLD_COUNT LOCATION                   
  --------------------------- --------------------------- ------------ ----------- ------------- -------------- ---------------------------
  global ctx hash table latch kzctxgscv: allocate context            0           7             0              5 kzctxgscv: allocate context
  global ctx hash table latch kzctxggcv: get context val             0           0             7              0 kzctxggcv: get context val 

select (select s.program from v$session s where sid=w.sid) program
      ,chain_signature, osid, pid, sid
      ,blocker_is_valid bvalid, blocker_sid, p1, p1_text, p2, p3, in_wait_secs, num_waiters
  from v$wait_chains w
 order by in_wait_secs desc nulls first;
 
  PROGRAM CHAIN_SIGNATURE OSID  PID SID BVALI BLOCKER_SID         P1 P1_TEXT   P2 P3 IN_WAIT_SECS NUM_WAITERS
  ------- --------------- ----- --- --- ----- ----------- ---------- -------- --- -- ------------ -----------
  sqlplus <='latch free'  19944  24  10 FALSE                                                               1
  sqlplus <='latch free'  19946  23 903 TRUE           10 1610899800 address  429  0           46           0
Now we kill SID1 process by:

kill -9 19944 
Immediately SID2 throws ORA-00600 error:

SID2 (SPID 19946) > declare
    2  l_lockname   varchar2(1000);
    3  l_lockid     number;
    4  begin
    5  l_val := sys_context('my_ctx', 'my_lockname');
    6  select lockid into l_lockid from sys.dbms_lock_allocated where name = l_lockname;
    7  end;
    8  /
  declare
  *
  ERROR at line 1:
  ORA-00600: internal error code, arguments: [504], 
    [0x18C5C8AD8], [32], [4], [enqueue hash chains], [6], [0x000000000], [], [], [], [], []
Look the generated incident file, call stack shows that the error was raised in subroutine: "ksl_level_check" ("ERROR SIGNALED: yes"). It signifies that once we already held Level 5: 'global ctx hash table latch', but still require level 4 'enqueue hash chains' latch, hence ordering rule is violated (5 < 4).

  ========= Dump for incident 274833 (ORA 600 [504]) ========
  [TOC00003]
  ----- Beginning of Customized Incident Dump(s) -----
  *** LATCH HIERARCHY ERROR ***
  An attempt to get the 'enqueue hash chains' latch (child #6) at level 4 (address = 0x18c5c8ad8)
  from location 'ksq.h LINE:2656 ID:ksqgtl3' (also see call stack below)
  conflicts with the following held latch(es):
   Level 5: 'global ctx hash table latch' (address = 0x60046158)
    gotten from location 'kzctxg.h LINE:595 ID:kzctxggcv: get context val'
  
  --------------------- Binary Stack Dump ---------------------
  [1]  (ksedst()+307 -> skdstdst()) 
  [2]  (dbkedDefDump()+1121 -> ksedst()) 
  [3]  (ksedmp()+304 -> dbkedDefDump()) 
  [4]  (dbgexPhaseII()+2387 -> ksedmp()) 
  [5]  (dbgexExplicitEndInc()+1325 -> dbgexPhaseII()) 
  [6]  (dbgeEndDDEInvocationImpl()+875 -> dbgexExplicitEndInc()) 
  [7]  (ksl_level_check()+2673 -> dbgeEndDDEInvocation()) 
          CALL TYPE: call   ERROR SIGNALED: yes   COMPONENT: VOS
  [8]  (kslgetl()+409 -> ksl_level_check()) 
  [9]  (ksqgtlctx()+1324 -> kslgetl()) 
  [10] (ksqgelctx()+615 -> ksqgtlctx()) 
  [11] (kksfbc()+5913 -> ksqgelctx()) 
  [12] (kkspbd0()+801 -> kksfbc()) 
  [13] (kksParseCursor()+452 -> kkspbd0()) 
  [14] (opiosq0()+1802 -> kksParseCursor()) 
  [15] (opipls()+2568 -> opiosq0()) 
In SID3, run query again, latch GETS, MISSES and SLEEPS all increased 1.

select addr, latch#, level#, gets, misses, sleeps, spin_gets, wait_time 
from v$latch where name = 'global ctx hash table latch';

  ADDR     LATCH# LEVEL# GETS MISSES SLEEPS SPIN_GETS  WAIT_TIME
  -------- ------ ------ ---- ------ ------ --------- ----------
  60046158    429      5 6433      9      7         2 1961502719
The incident file shows that we are trying to get 'enqueue hash chains' latch (child #5) at level 4 (address = 0x18c5c8ad8). Subroutine "kksParseCursor" and "kksfbc" (Kernel Kompile Shared Find Bound Cursor) in call stack indicate that that this request comes from SQL parsing. To make a lookup of shared resource stored in an array of hash buckets, searching text has to hash to an array index, acquiring latch, and then traverse hash chain to access the shared resource.

With following query, we can also list this child latch.

select name, addr, latch#, child#, level#, gets, misses, sleeps, spin_gets
from v$latch_children where addr like '%18C5C8AD8';

  NAME                ADDR      LATCH# CHILD# LEVEL#     GETS MISSES SLEEPS  SPIN_GETS
  ------------------- --------- ------ ------ ------ -------- ------ ------ ----------
  enqueue hash chains 18C5C8AD8     36      6      4 74790437     81      0         81
Note that 'global ctx hash table latch' is a single latch (without children), but 'enqueue hash chains' has a few children (6 in test DB).

In above test, we manually kill SID1 process. In real applications, this can happen when session is abnormally terminated (for example, when memory is under pressure, OS can choose one process to kill).

Above test also shows that when SID1 is killed, the unaware SID2 is abruptly disconnected, an unexpectedly impact on real application operations.


3. Non-Blocking Test


Instead of setting global context value in above test, we will look what happens if we only get global context value.

In SID3, run code below:

alter system flush shared_pool;
alter system flush buffer_cache;

select addr, latch#, level#, gets, misses, sleeps, spin_gets, wait_time 
from v$latch where name = 'global ctx hash table latch';

  ADDR     LATCH# LEVEL# GETS MISSES SLEEPS  SPIN_GETS  WAIT_TIME
  -------- ------ ------ ---- ------ ------ ---------- ----------
  60046158    429      5 6721     10      7          3 1961502719
Again run the same Dtrace for SID1 (SPID 20024):

sudo dtrace -w -n  \
'
pid$target::kslfre:entry /arg0 == 0x60046158/  {
  printf("\n%s:%s (Addr=>0x%X, Mode=>%d, PID=>0x%X)", probefunc, probename, arg0, arg4, arg5);
  @CNT[ustack(10, 0)] = count();
  stop(); exit(0);
  }
' -p 20024
In SID1, get global context value:

SID1 (SPID 20024) > 
  declare 
   l_val        varchar2(1000);
  begin
   l_val := sys_context('my_ctx', 'var1'); 
  end;
  / 
Dtrace shows calling from getter: kzctxGblCtxGet (in above Blocking Test, that is setter: kzctxgset):

  dtrace: description '
  pid$target::kslfre:entry ' matched 1 probe
  dtrace: allowing destructive actions
   CPU     ID                    FUNCTION:NAME
     2  81237                     kslfre:entry
  kslfre:entry (Addr=>0x60046158, Mode=>1, PID=>0x4B)
  
      oracle`kslfre
      oracle`kzctxGblCtxGet+0x841
      oracle`kzctxAppCtxValGet3+0x481
      oracle`psdsysctx+0x1eb
      oracle`pessysctx2+0xe3
      oracle`pevm_icd_call_common+0x12b
      oracle`pfrinstr_BCAL+0x4e
      oracle`pfrrun_no_tool+0x12a
      oracle`pfrrun+0x4c0
      oracle`plsql_run+0x288
Query of latch statistics in SID3 shows GETS increasing 1, and single latch 'global ctx hash table latch' is held by SID1.

select addr, latch#, level#, gets, misses, sleeps, spin_gets, wait_time 
from v$latch where name = 'global ctx hash table latch';

  ADDR     LATCH# LEVEL# GETS MISSES SLEEPS  SPIN_GETS  WAIT_TIME
  -------- ------ ------ ---- ------ ------ ---------- ----------
  60046158    429      5 6722     10      7          3 1961502719

select * from v$latchholder;

   PID SID LADDR    NAME                        GETS
  ---- --- -------- --------------------------- ---- 
    24  10 60046158 global ctx hash table latch 6722 
Repeat above same SID2 (SPID 20026) code:

SID2 (SPID 20026) > 
  declare 
   l_lockname   varchar2(1000);
   l_lockid     number;
  begin
   l_lockname := sys_context('my_ctx', 'my_lockname'); 
   select lockid into l_lockid from sys.dbms_lock_allocated where name = l_lockname;
  end;
  / 

  PL/SQL procedure successfully completed.
SID2 is not blocked by SID1. Above Dtrace output shows that SID1 latch request is "Mode=>1" (probably shared mode).

Query of latch statistics in SID3 shows GETS incrasing 1, but not MISSES and SLEEPS even though single latch 'global ctx hash table latch' is still held by SID1.

select addr, latch#, level#, gets, misses, sleeps, spin_gets, wait_time 
from v$latch where name = 'global ctx hash table latch';

  ADDR     LATCH# LEVEL# GETS MISSES SLEEPS  SPIN_GETS  WAIT_TIME
  -------- ------ ------ ---- ------ ------ ---------- ----------
  60046158    429      5 6723     10      7          3 1961502719


4. Direct Subroutine Call Test


Blog: Oracle 12 and latches, part 3 (Frits Hoogland Weblog) shows that latch get subrountine: "ksl_get_shared_latch" can be directly called. We will try it with following test (in Oracle 19c on Linux).

In SID3, run query below to get 'global ctx hash table latch' address and stats:

select addr, latch#, level#, gets, misses, sleeps, spin_gets, wait_time 
from v$latch where name = 'global ctx hash table latch';

  ADDR     LATCH# LEVEL#  GETS MISSES SLEEPS  SPIN_GETS  WAIT_TIME
  -------- ------ ------ ----- ------ ------ ---------- ----------
  60077140    589      5 10879     10      2          8         57
In SID1, call "ksl_get_shared_latch" to acquire 'global ctx hash table latch':

SID1 > oradebug setmypid
  Statement processed.
  
SID1 > oradebug call ksl_get_shared_latch 0x60077140 1 0 4618 16
  Function returned 1
In SID2, set global context: (Note that in previous two tests, we set global context in SID1)

SID2 > exec my_pkg.set_ctx('my_lockname', 'MY_ULOCK');
SID2 is blocked by SID1 with Wait Event: 'latch free'.

Now back to SID1, and run query (to request 'enqueue hash chains' latch). Immediately SID1 throws the same ORA-00600 [504].

SID1 > select lockid from sys.dbms_lock_allocated where name = 'MY_ULOCK';
  select lockid from sys.dbms_lock_allocated where name = 'MY_ULOCK'
  *
  ERROR at line 1:
  ORA-00600: internal error code, arguments: [504], [0x0B41C6288], [32], [4],
  [enqueue hash chains], [6], [0x000000000], [], [], [], [], []
  
and incident file contains the detail:

  ORA-00600: internal error code, arguments: [504], 
    [0x0B41C6288], [32], [4], [enqueue hash chains], [6], [0x000000000], [], [], [], [], []
  
  ========= Dump for incident 45273 (ORA 600 [504]) ========
  [TOC00003]
  ----- Beginning of Customized Incident Dump(s) -----
  *** LATCH HIERARCHY ERROR ***
  An attempt to get the 'enqueue hash chains' latch (child #6) at level 4 (address = 0xb41c6288)
  from location 'ksq.h LINE:2721 ID:ksqgtl3' (also see call stack below)
  conflicts with the following held latch(es):
   Level 5: 'global ctx hash table latch' (address = 0x60077140)
    gotten from location 'krvx.h LINE:18027 ID:krvxpaf: activate'
  
  --------------------- Binary Stack Dump ---------------------
  [1]  (ksedst1()+95 -> kgdsdst())
  [2]  (ksedst()+58 -> ksedst1())
  [3]  (dbkedDefDump()+23080 -> ksedst())
  [4]  (ksedmp()+577 -> dbkedDefDump())
  [5]  (dbgexPhaseII()+2092 -> ksedmp())
  [6]  (dbgexExplicitEndInc()+285 -> dbgexPhaseII())
  [7]  (dbgeEndDDEInvocationImpl()+314 -> dbgexExplicitEndInc())
  [8]  (ksl_level_check()+2746 -> dbgeEndDDEInvocationImpl())
          CALL TYPE: call   ERROR SIGNALED: yes   COMPONENT: VOS
  [9]  (kslgetl()+2800 -> ksl_level_check())
  [10] (ksqgtlctx()+1873 -> kslgetl())
  [11] (ksqgelctx()+838 -> ksqgtlctx())
  [12] (kksfbc()+17684 -> ksqgelctx())
  [13] (kkspsc0()+1566 -> kksfbc())
  [14] (kksParseCursor()+114 -> kkspsc0())
  [15] (opiosq0()+2310 -> kksParseCursor())
In SID1, we first held Level 5: 'global ctx hash table latch', and subsequently requests Level 4: 'enqueue hash chains' latch (child #6) in order to parse (kksParseCursor) our submitted SQL, the latch ordering rule is violated, hence ORA-00600: [504] LATCH HIERARCHY ERROR.

If we look again SID2, it runs through.

SID2 > exec my_pkg.set_ctx('my_lockname', 'MY_ULOCK');
  PL/SQL procedure successfully completed.