The error is well documented in MOS Docu:
ORA-600 [504] "Trying to obtain a latch which is already held" (Doc ID 28104.1) This ORA-600 triggers when a latch request violates the ordering rules for obtaining latches and granting the latch would potentially result in a deadlock. These ordering rules are enforced by assigning each latch a "level#", and checking requested-level vs owned-levels on each latch request. ARGUMENTS: Arg [a] the requested latch address Arg [b] bit array of latches owned Arg [c] the level of the requested latch Arg [d] the name of the requested latch Arg [e] the child latch number of the requested latch Arg [f] if non zero, allows 2 wait gets of children Arg [g] address of latch owned at this levelFor example,
ORA-00600: internal error code, arguments: [504], [0x18C5C8AD8], [32], [4], [enqueue hash chains], [6], [0x000000000], [], [], [], [], []The above docu describes the latch ordering level is designed to prevent latch deadlocks when multiple latches are requested (the similar mechanism can be found in UNIX kernel code). Once a process acquires a latch of level x, it can only acquire another latch of level higher than x (monotonically increasing).
The docu also indicates that one session can simultaneously hold two or more latches as long as no ordering rule violation ("ordering" or sorting only applies if there are two or more elements).
For example, Blog Blog Oracle Redo Strand described two redo latches:
Generate Redo Entry in PGA -> Server process request Redo Copy latch (which is multiple, 2*CPU_COUNT default) -> Server process request redo allocation latch -> allocate log buffer -> release redo allocation latch -> Copy Redo Entrys into Log Buffer -> release Redo Copy latch"redo allocation" is requested inside "redo copy" since the former is LEVEL# 5, the latter is LEVEL# 4,
Whereas IMU (In-memory undo buffers) is a prerequisite for using a private strand, and it has LEVEL# 0.
Therefore, the request ordering is specified by their LEVEL# as follows:
select name, level# from v$latch where name in ('redo copy', 'redo allocation', 'In memory undo latch') order by level#;
NAME LEVEL#
-------------------- ------
In memory undo latch 0
redo copy 4
redo allocation 5
select name, level#, count(*) from v$latch_children where name in ('redo copy', 'redo allocation', 'In memory undo latch')
group by name, level# order by level#;
NAME LEVEL# COUNT(*)
-------------------- ------ --------
In memory undo latch 0 118
redo copy 4 12
redo allocation 5 120
"Arg [f]" reveals the design consideration to "enqueue" the second willing-to-wait mode Get
(in addition to the current one).In this Blog, we will create a use case to demonstrate the error and the unexpectedly impact on real application operations.
Note: Tested in Oracle 12c, 19c
1. Test Setup
First, we create a global application context, and allocate a user lock (insert one row into sys.dbms_lock_allocated) so that two latches:
'global ctx hash table latch' (Level 5) 'enqueue hash chains' (Level 4)will be requested in the test.
create or replace context my_ctx using my_pkg accessed globally;
/
create or replace package my_pkg is
procedure set_ctx(p_name varchar2, p_value varchar2);
end;
/
create or replace package body my_pkg is
procedure set_ctx(p_name varchar2, p_value varchar2) is
begin
dbms_session.set_context('my_ctx', p_name, p_value);
end set_ctx;
end;
/
declare
o_lock_hdl varchar2(128);
begin
dbms_lock.allocate_unique(lockname => 'MY_ULOCK', lockhandle => o_lock_hdl);
end;
/
---- Smoking Tests ----
exec my_pkg.set_ctx('my_lockname', 'MY_ULOCK');
select sys_context('my_ctx', 'my_lockname') var1 from dual;
select * from sys.dbms_lock_allocated where name = 'MY_ULOCK';
select addr, latch#, level#, gets, misses, sleeps, spin_gets, wait_time
from v$latch where name = 'global ctx hash table latch';
2. Blocking Test
Open 3 Sqlplus sessions: SID1 (SPID 19944), SID2(SPID 19946), SID3. First two are test sessions, third one is a monitor session (Test in Oracle 12c on Solaris).
In SID3, run code below to display 'global ctx hash table latch':
alter system flush shared_pool;
alter system flush buffer_cache;
select addr, latch#, level#, gets, misses, sleeps, spin_gets, wait_time
from v$latch where name = 'global ctx hash table latch';
ADDR LATCH# LEVEL# GETS MISSES SLEEPS SPIN_GETS WAIT_TIME
-------- ------ ------ ---- ------ ------ ---------- ----------
60046158 429 5 6431 8 6 2 1815729481
select * from v$latch_misses
where parent_name = 'global ctx hash table latch'
and (sleep_count > 0 or wtr_slp_count > 0);
PARENT_NAME WHERE NWFAIL_COUNT SLEEP_COUNT WTR_SLP_COUNT LONGHOLD_COUNT LOCATION
--------------------------- --------------------------- ------------ ----------- ------------- -------------- ---------------------------
global ctx hash table latch kzctxgscv: allocate context 0 6 0 5 kzctxgscv: allocate context
global ctx hash table latch kzctxggcv: get context val 0 0 6 0 kzctxggcv: get context val
For SID1 (SPID 19944), we also open an UNIX terminal, and run Dtrace:
sudo dtrace -w -n \
'
pid$target::kslfre:entry /arg0 == 0x60046158/ {
printf("\n%s:%s (Addr=>0x%X, Mode=>%d, PID=>0x%X)", probefunc, probename, arg0, arg4, arg5);
@CNT[ustack(10, 0)] = count();
stop(); exit(0);
}
' -p 19944
Then set context by:
SID1 (SPID 19944) > exec my_pkg.set_ctx('my_lockname', 'MY_ULOCK');
Immediately we can see SID1 was suspended, and Dtrace output displays latch request parameter and call stack.
It shows that latch is requested in "Mode=>16" (probably exclusive mode).
dtrace: description '
pid$target::kslfre:entry ' matched 1 probe
dtrace: allowing destructive actions
CPU ID FUNCTION:NAME
4 81237 kslfre:entry
kslfre:entry (Addr=>0x60046158, Mode=>16, PID=>0x2000000000000018)
oracle`kslfre
oracle`kzctxgset+0x195
oracle`kzctxgscv+0x5e7
oracle`kzctxesc+0x789
oracle`pevm_icd_call_common+0x29d
oracle`pfrinstr_ICAL+0x90
oracle`pfrrun_no_tool+0x12a
oracle`pfrrun+0x4c0
oracle`plsql_run+0x288
oracle`peicnt+0x946
Run the same queries in SID3 again to show latch usage statistics (latch GETS increasing 1) and latch holder:
select addr, latch#, level#, gets, misses, sleeps, spin_gets, wait_time
from v$latch where name = 'global ctx hash table latch';
ADDR LATCH# LEVEL# GETS MISSES SLEEPS SPIN_GETS WAIT_TIME
-------- ------ ------ ---- ------ ------ ---------- ----------
60046158 429 5 6432 8 6 2 1815729481
select * from v$latchholder;
PID SID LADDR NAME GETS
--- ---- -------- --------------------------- ----
24 10 60046158 global ctx hash table latch 6432
Now in SID2, we run a Plsql block below to access global context to return lock_name,
and then get user lock with it.
SID2 (SPID 19946) >
declare
l_lockname varchar2(1000);
l_lockid number;
begin
l_lockname := sys_context('my_ctx', 'my_lockname');
select lockid into l_lockid from sys.dbms_lock_allocated where name = l_lockname;
end;
/
It is blocked by SID1 on 'global ctx hash table latch' request.
Probably since SID1 held it in exclusive mode ("Mode=>16").
We can run queries in SID3 to display the blocking chain.
select * from v$latch_misses
where parent_name = 'global ctx hash table latch'
and (sleep_count > 0 or wtr_slp_count > 0);
PARENT_NAME WHERE NWFAIL_COUNT SLEEP_COUNT WTR_SLP_COUNT LONGHOLD_COUNT LOCATION
--------------------------- --------------------------- ------------ ----------- ------------- -------------- ---------------------------
global ctx hash table latch kzctxgscv: allocate context 0 7 0 5 kzctxgscv: allocate context
global ctx hash table latch kzctxggcv: get context val 0 0 7 0 kzctxggcv: get context val
select (select s.program from v$session s where sid=w.sid) program
,chain_signature, osid, pid, sid
,blocker_is_valid bvalid, blocker_sid, p1, p1_text, p2, p3, in_wait_secs, num_waiters
from v$wait_chains w
order by in_wait_secs desc nulls first;
PROGRAM CHAIN_SIGNATURE OSID PID SID BVALI BLOCKER_SID P1 P1_TEXT P2 P3 IN_WAIT_SECS NUM_WAITERS
------- --------------- ----- --- --- ----- ----------- ---------- -------- --- -- ------------ -----------
sqlplus <='latch free' 19944 24 10 FALSE 1
sqlplus <='latch free' 19946 23 903 TRUE 10 1610899800 address 429 0 46 0
Now we kill SID1 process by:
kill -9 19944
Immediately SID2 throws ORA-00600 error:
SID2 (SPID 19946) > declare
2 l_lockname varchar2(1000);
3 l_lockid number;
4 begin
5 l_val := sys_context('my_ctx', 'my_lockname');
6 select lockid into l_lockid from sys.dbms_lock_allocated where name = l_lockname;
7 end;
8 /
declare
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [504],
[0x18C5C8AD8], [32], [4], [enqueue hash chains], [6], [0x000000000], [], [], [], [], []
Look the generated incident file, call stack shows that the error was raised in subroutine:
"ksl_level_check" ("ERROR SIGNALED: yes").
It signifies that once we already held Level 5: 'global ctx hash table latch',
but still require level 4 'enqueue hash chains' latch, hence ordering rule is violated (5 < 4).
========= Dump for incident 274833 (ORA 600 [504]) ========
[TOC00003]
----- Beginning of Customized Incident Dump(s) -----
*** LATCH HIERARCHY ERROR ***
An attempt to get the 'enqueue hash chains' latch (child #6) at level 4 (address = 0x18c5c8ad8)
from location 'ksq.h LINE:2656 ID:ksqgtl3' (also see call stack below)
conflicts with the following held latch(es):
Level 5: 'global ctx hash table latch' (address = 0x60046158)
gotten from location 'kzctxg.h LINE:595 ID:kzctxggcv: get context val'
--------------------- Binary Stack Dump ---------------------
[1] (ksedst()+307 -> skdstdst())
[2] (dbkedDefDump()+1121 -> ksedst())
[3] (ksedmp()+304 -> dbkedDefDump())
[4] (dbgexPhaseII()+2387 -> ksedmp())
[5] (dbgexExplicitEndInc()+1325 -> dbgexPhaseII())
[6] (dbgeEndDDEInvocationImpl()+875 -> dbgexExplicitEndInc())
[7] (ksl_level_check()+2673 -> dbgeEndDDEInvocation())
CALL TYPE: call ERROR SIGNALED: yes COMPONENT: VOS
[8] (kslgetl()+409 -> ksl_level_check())
[9] (ksqgtlctx()+1324 -> kslgetl())
[10] (ksqgelctx()+615 -> ksqgtlctx())
[11] (kksfbc()+5913 -> ksqgelctx())
[12] (kkspbd0()+801 -> kksfbc())
[13] (kksParseCursor()+452 -> kkspbd0())
[14] (opiosq0()+1802 -> kksParseCursor())
[15] (opipls()+2568 -> opiosq0())
In SID3, run query again, latch GETS, MISSES and SLEEPS all increased 1.
select addr, latch#, level#, gets, misses, sleeps, spin_gets, wait_time
from v$latch where name = 'global ctx hash table latch';
ADDR LATCH# LEVEL# GETS MISSES SLEEPS SPIN_GETS WAIT_TIME
-------- ------ ------ ---- ------ ------ --------- ----------
60046158 429 5 6433 9 7 2 1961502719
The incident file shows that we are trying to get
'enqueue hash chains' latch (child #5) at level 4 (address = 0x18c5c8ad8).
Subroutine "kksParseCursor" and "kksfbc" (Kernel Kompile Shared Find Bound Cursor)
in call stack indicate that that this request comes from SQL parsing.
To make a lookup of shared resource stored in an array of hash buckets,
searching text has to hash to an array index, acquiring latch, and then traverse hash chain
to access the shared resource.With following query, we can also list this child latch.
select name, addr, latch#, child#, level#, gets, misses, sleeps, spin_gets
from v$latch_children where addr like '%18C5C8AD8';
NAME ADDR LATCH# CHILD# LEVEL# GETS MISSES SLEEPS SPIN_GETS
------------------- --------- ------ ------ ------ -------- ------ ------ ----------
enqueue hash chains 18C5C8AD8 36 6 4 74790437 81 0 81
Note that 'global ctx hash table latch' is a single latch (without children),
but 'enqueue hash chains' has a few children (6 in test DB).In above test, we manually kill SID1 process. In real applications, this can happen when session is abnormally terminated (for example, when memory is under pressure, OS can choose one process to kill).
Above test also shows that when SID1 is killed, the unaware SID2 is abruptly disconnected, an unexpectedly impact on real application operations.
3. Non-Blocking Test
Instead of setting global context value in above test, we will look what happens if we only get global context value.
In SID3, run code below:
alter system flush shared_pool;
alter system flush buffer_cache;
select addr, latch#, level#, gets, misses, sleeps, spin_gets, wait_time
from v$latch where name = 'global ctx hash table latch';
ADDR LATCH# LEVEL# GETS MISSES SLEEPS SPIN_GETS WAIT_TIME
-------- ------ ------ ---- ------ ------ ---------- ----------
60046158 429 5 6721 10 7 3 1961502719
Again run the same Dtrace for SID1 (SPID 20024):
sudo dtrace -w -n \
'
pid$target::kslfre:entry /arg0 == 0x60046158/ {
printf("\n%s:%s (Addr=>0x%X, Mode=>%d, PID=>0x%X)", probefunc, probename, arg0, arg4, arg5);
@CNT[ustack(10, 0)] = count();
stop(); exit(0);
}
' -p 20024
In SID1, get global context value:
SID1 (SPID 20024) >
declare
l_val varchar2(1000);
begin
l_val := sys_context('my_ctx', 'var1');
end;
/
Dtrace shows calling from getter: kzctxGblCtxGet (in above Blocking Test, that is setter: kzctxgset):
dtrace: description '
pid$target::kslfre:entry ' matched 1 probe
dtrace: allowing destructive actions
CPU ID FUNCTION:NAME
2 81237 kslfre:entry
kslfre:entry (Addr=>0x60046158, Mode=>1, PID=>0x4B)
oracle`kslfre
oracle`kzctxGblCtxGet+0x841
oracle`kzctxAppCtxValGet3+0x481
oracle`psdsysctx+0x1eb
oracle`pessysctx2+0xe3
oracle`pevm_icd_call_common+0x12b
oracle`pfrinstr_BCAL+0x4e
oracle`pfrrun_no_tool+0x12a
oracle`pfrrun+0x4c0
oracle`plsql_run+0x288
Query of latch statistics in SID3 shows GETS increasing 1,
and single latch 'global ctx hash table latch' is held by SID1.
select addr, latch#, level#, gets, misses, sleeps, spin_gets, wait_time
from v$latch where name = 'global ctx hash table latch';
ADDR LATCH# LEVEL# GETS MISSES SLEEPS SPIN_GETS WAIT_TIME
-------- ------ ------ ---- ------ ------ ---------- ----------
60046158 429 5 6722 10 7 3 1961502719
select * from v$latchholder;
PID SID LADDR NAME GETS
---- --- -------- --------------------------- ----
24 10 60046158 global ctx hash table latch 6722
Repeat above same SID2 (SPID 20026) code:
SID2 (SPID 20026) >
declare
l_lockname varchar2(1000);
l_lockid number;
begin
l_lockname := sys_context('my_ctx', 'my_lockname');
select lockid into l_lockid from sys.dbms_lock_allocated where name = l_lockname;
end;
/
PL/SQL procedure successfully completed.
SID2 is not blocked by SID1. Above Dtrace output shows that SID1 latch request is "Mode=>1"
(probably shared mode).Query of latch statistics in SID3 shows GETS incrasing 1, but not MISSES and SLEEPS even though single latch 'global ctx hash table latch' is still held by SID1.
select addr, latch#, level#, gets, misses, sleeps, spin_gets, wait_time
from v$latch where name = 'global ctx hash table latch';
ADDR LATCH# LEVEL# GETS MISSES SLEEPS SPIN_GETS WAIT_TIME
-------- ------ ------ ---- ------ ------ ---------- ----------
60046158 429 5 6723 10 7 3 1961502719
4. Direct Subroutine Call Test
Blog: Oracle 12 and latches, part 3 (Frits Hoogland Weblog) shows that latch get subrountine: "ksl_get_shared_latch" can be directly called. We will try it with following test (in Oracle 19c on Linux).
In SID3, run query below to get 'global ctx hash table latch' address and stats:
select addr, latch#, level#, gets, misses, sleeps, spin_gets, wait_time
from v$latch where name = 'global ctx hash table latch';
ADDR LATCH# LEVEL# GETS MISSES SLEEPS SPIN_GETS WAIT_TIME
-------- ------ ------ ----- ------ ------ ---------- ----------
60077140 589 5 10879 10 2 8 57
In SID1, call "ksl_get_shared_latch" to acquire 'global ctx hash table latch':
SID1 > oradebug setmypid
Statement processed.
SID1 > oradebug call ksl_get_shared_latch 0x60077140 1 0 4618 16
Function returned 1
In SID2, set global context:
(Note that in previous two tests, we set global context in SID1)
SID2 > exec my_pkg.set_ctx('my_lockname', 'MY_ULOCK');
SID2 is blocked by SID1 with Wait Event: 'latch free'.Now back to SID1, and run query (to request 'enqueue hash chains' latch). Immediately SID1 throws the same ORA-00600 [504].
SID1 > select lockid from sys.dbms_lock_allocated where name = 'MY_ULOCK';
select lockid from sys.dbms_lock_allocated where name = 'MY_ULOCK'
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [504], [0x0B41C6288], [32], [4],
[enqueue hash chains], [6], [0x000000000], [], [], [], [], []
and incident file contains the detail:
ORA-00600: internal error code, arguments: [504],
[0x0B41C6288], [32], [4], [enqueue hash chains], [6], [0x000000000], [], [], [], [], []
========= Dump for incident 45273 (ORA 600 [504]) ========
[TOC00003]
----- Beginning of Customized Incident Dump(s) -----
*** LATCH HIERARCHY ERROR ***
An attempt to get the 'enqueue hash chains' latch (child #6) at level 4 (address = 0xb41c6288)
from location 'ksq.h LINE:2721 ID:ksqgtl3' (also see call stack below)
conflicts with the following held latch(es):
Level 5: 'global ctx hash table latch' (address = 0x60077140)
gotten from location 'krvx.h LINE:18027 ID:krvxpaf: activate'
--------------------- Binary Stack Dump ---------------------
[1] (ksedst1()+95 -> kgdsdst())
[2] (ksedst()+58 -> ksedst1())
[3] (dbkedDefDump()+23080 -> ksedst())
[4] (ksedmp()+577 -> dbkedDefDump())
[5] (dbgexPhaseII()+2092 -> ksedmp())
[6] (dbgexExplicitEndInc()+285 -> dbgexPhaseII())
[7] (dbgeEndDDEInvocationImpl()+314 -> dbgexExplicitEndInc())
[8] (ksl_level_check()+2746 -> dbgeEndDDEInvocationImpl())
CALL TYPE: call ERROR SIGNALED: yes COMPONENT: VOS
[9] (kslgetl()+2800 -> ksl_level_check())
[10] (ksqgtlctx()+1873 -> kslgetl())
[11] (ksqgelctx()+838 -> ksqgtlctx())
[12] (kksfbc()+17684 -> ksqgelctx())
[13] (kkspsc0()+1566 -> kksfbc())
[14] (kksParseCursor()+114 -> kkspsc0())
[15] (opiosq0()+2310 -> kksParseCursor())
In SID1, we first held Level 5: 'global ctx hash table latch',
and subsequently requests Level 4: 'enqueue hash chains' latch (child #6)
in order to parse (kksParseCursor) our submitted SQL,
the latch ordering rule is violated, hence ORA-00600: [504] LATCH HIERARCHY ERROR.If we look again SID2, it runs through.
SID2 > exec my_pkg.set_ctx('my_lockname', 'MY_ULOCK');
PL/SQL procedure successfully completed.