Wednesday, June 14, 2023

One Deadlock of 'row cache lock' and 'library cache lock'

In this Blog, we first demonstrate a deadlock of 'row cache lock' and 'library cache lock', and then provide an analysis.

Note: Tested in Oracle 19.17 on Linux.


1. Test Code


The test involves a cluster table and two Sql statements, one is a select, another is an alter.

drop table test_tab;

drop cluster test_cluster;

create cluster test_cluster (x number);

create index test_cluster_indx on cluster test_cluster;

create table test_tab (x, y) cluster test_cluster(x)
  as select level, 'abc' from dual connect by level <= 3;


create or replace procedure test_proc_sel_tab(p_cnt number) as
begin   
  for i in 1..p_cnt loop
    for c in (select * from test_tab) loop
      null;
    end loop;
    dbms_session.sleep(0.01);
  end loop;
end;
/

-- exec test_proc_sel(5);

create or replace procedure test_proc_alt_cluster(p_cnt number) as
begin
  for i in 1..p_cnt loop
    execute immediate 'ALTER CLUSTER test_cluster PCTFREE 18 INITRANS 19';
  end loop;
end;
/

-- exec test_proc_alt(5);


2. Test Run


Open two Sqlplus sessions, do following test. After a few seconds, Session-2 throws: ORA-00060: deadlock

-- Session-1@T1, run about 10 seconds

exec test_proc_sel_tab(1e4);

-- Session-2@T2, run within 10 seconds

exec test_proc_alt_cluster(1e4);

  -- output
  --   ORA-00060: deadlock detected while waiting for resource
  --   ORA-06512: at line 3
In order to make deadlock reproducible, both sessions execute statements concurrently in a loop.
If both statements are executed only occasionally at the same time, probably the deadlock appears only intermittently.


3. Trace Files


Here trace files after deadlock occurrence.


3.1 Session-1 Trace File



----- Current SQL Statement for this session (sql_id=g677askf3bry6) -----
SELECT * FROM TEST_TAB

----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
0x9a8b9220         4  procedure K.TEST_PROC_SEL_TAB
0x9a67a4b0         1  anonymous block

--------------------- Binary Stack Dump ---------------------

[1] (ksedst1()+95 -> kgdsdst())
...
[8] (kqrLockPo()+274 -> kqrget())
[9] (kqrpre()+1095 -> kqrLockPo())
[10] (kkdlgstcd()+347 -> kqrpre())
[11] (kkdlgstd()+3385 -> kkdlgstcd())
[12] (kkmfcbloCbk()+179 -> kkdlgstd())
[13] (kkmpfcbk()+235 -> kkmfcbloCbk())
[14] (qcsprfro()+648 -> kkmpfcbk())
[15] (qcsprfro_tree()+372 -> qcsprfro())
[16] (qcsprfro_tree()+138 -> qcsprfro_tree())
[17] (qcspafq()+240 -> qcsprfro_tree())
[18] (qcspqbDescendents()+264 -> qcspafq())
[19] (qcspqb()+255 -> qcspqbDescendents())
[20] (kkmdrv()+180 -> qcspqb())
[21] (opiSem()+2191 -> kkmdrv())
[22] (opiprs()+321 -> opiSem())
[23] (kksParseChildCursor()+527 -> opiprs())
[24] (rpiswu2()+2004 -> kksParseChildCursor())
[25] (kksLoadChild()+5287 -> rpiswu2())
[26] (kxsGetRuntimeLock()+2335 -> kksLoadChild())
[27] (kksfbc()+19638 -> kxsGetRuntimeLock())
[28] (opiexe()+2864 -> kksfbc())

-- kqrLockPo.rsi=r15 = 000000009AE73170 = v$rowcache_parent.ADDRESS of TEST_CLUSTER	

========== FRAME [9] (kqrpre()+1095 -> kqrLockPo()) ========== 
defined by frame pointers 0x7ffdf28dce80  and 0x7ffdf28dcd30
CALL TYPE: call   ERROR SIGNALED: no   COMPONENT: (null)
RDI 000000000000000B RSI 000000009AE73170 RDX 00000000B76FF440 
RCX 0000000000000003 R8 0000000000000000 R9 00007F4C00000000 
RAX 0000000000000000 RBX 0000000000000000 RBP 00007FFDF28DCE80 
R10 00007FFDF28DB010 R11 0000000000000000 R12 0000000000000003 
R13 00007F4CDA143140 R14 00000000A789CB80 R15 000000009AE73170 
RSP 00007FFDF28DCD40 RIP 0000000012DCE457 

          SOC: 0x9c32fb98, type: row cache enqueues (111), map: 0xa5f39680
               state: LIVE (0x99fc), flags: INIT (0x1)
          row cache enqueue: count=1 session=0xb71de728 object=0x9ae73170, request=S
          flag=03 WAI/TRC/-/-/-/-/-/- savepoint=0xd46f
          row cache parent object: addr=0x9ae73170 cid=8(dc_objects)
          conid=0 conuid=0 inc=1, pdbinc=1
          hash=d5c4d330 typ=1 transaction=0xac528090 flags=0000002a
          version=13 mtx version=134
          objectno=4774244 ownerid=49 nsp=5
          name=TEST_CLUSTER
          own=0x9ae73240[0x9d422ca0,0x9d422ca0] wat=0x9ae73250[0x9c32fc18,0x9c32fc18] mode=X req=N
                   
    Current Wait Stack:
     0: waiting for 'row cache lock'
        cache id=0x8, mode=0x0, request=0x3
        wait_id=6425 seq_num=6454 snap_id=1
        wait times: snap=1.030535 sec, exc=1.030535 sec, total=1.030535 sec
        wait times: max=infinite, heur=1.030535 sec
        wait counts: calls=1 os=1
        in_wait=1 iflags=0x55a2

    Dumping one waiter:
      inst: 1, sid: 369, ser: 13483
      wait event: 'library cache lock'
        p1: 'handle address'=0x9aebb010
        p2: 'lock address'=0x9d8cc410
        p3: '100*mode+namespace'=0x48d96600010003
      row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0
      min_blocked_time: 0 secs, waiter_cache_ver: 117


3.2 Session-2 Trace file



----- Current SQL Statement for this session (sql_id=42y5q6y5vyxsw) -----
ALTER CLUSTER test_cluster PCTFREE 18 INITRANS 19

----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
0x9a67e1c0         4  procedure K.TEST_PROC_ALT_CLUSTER
0x9a6665d8         1  anonymous block

--------------------- Binary Stack Dump ---------------------

[1] (ksedst1()+95 -> kgdsdst()) 
...
[7] (kglLockWait()+1240 -> ksfgti()) 
[8] (kgllkal()+4711 -> kglLockWait()) 
[9] (kglLock()+1425 -> kgllkal()) 
[10] (kglget()+293 -> kglLock()) 
[11] (kkdllk0()+418 -> kglget()) 
[12] (kkdllcl()+557 -> kkdllk0()) 
[13] (kkdllct()+26 -> kkdllcl()) 
[14] (acldrv()+1359 -> kkdllct()) 
[15] (opiexe()+30659 -> acldrv()) 

-- kgllkal.rdx = 000000009AEBB010 = v$db_object_cache.ADDR for TEST_TAB

========== FRAME [9] (kglLock()+1425 -> kgllkal()) ========== 
defined by frame pointers 0x7ffd39edb340  and 0x7ffd39edade0
CALL TYPE: call   ERROR SIGNALED: no   COMPONENT: LIBCACHE
RDI 00007F087E5419C0 RSI 00007FFD39EDB2E0 RDX 000000009AEBB010 
RCX 0000000000000003 R8 000000009AEBB090 R9 00007FFD39EDBAA0 
RAX 00000000A541C1F8 RBX 0000000000000001 RBP 00007FFD39EDB340 
R10 00007FFD39ED8FF0 R11 00000000132262E0 R12 0000000000000000 
R13 0000000000000000 R14 0000000000000000 R15 0000000000000006 
RSP 00007FFD39EDADF0 RIP 00000000130AC291 

    Current Wait Stack:
     0: waiting for 'library cache lock'
        handle address=0x9aebb010, lock address=0x9d8cc410, 100*mode+namespace=0x48d96600010003
        wait_id=5715 seq_num=5755 snap_id=1
        wait times: snap=1.002169 sec, exc=1.002169 sec, total=1.002169 sec
        wait times: max=15 min 0 sec, heur=1.002169 sec
        wait counts: calls=1 os=1
        in_wait=1 iflags=0x55a2
        
    Dumping one waiter:
      inst: 1, sid: 549, ser: 40906
      wait event: 'row cache lock'
        p1: 'cache id'=0x8
        p2: 'mode'=0x0
        p3: 'request'=0x3
      row_wait_obj#: 105930, block#: 271548, row#: 0, file# 47
      min_blocked_time: 0 secs, waiter_cache_ver: 117


3.3 diag_base trace file



DEADLOCK DETECTED (id=0x770be968)
Chain Signature: 'library cache lock'<='row cache lock' (cycle)
 
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
              process id: 57, oracle@test_db
              session id: 549
    }
    is waiting for 'row cache lock' with wait info:
    {
                      p1: 'cache id'=0x8                 -- P1=8  dc_objects (PARENT) or dc_object_grants
                      p2: 'mode'=0x0
                      p3: 'request'=0x3                  -- P3=3 (share mode), P3=5 (exclusive)
             current sql: SELECT * FROM TEST_TAB
    }
    and is blocked by
 => Oracle session identified by:
    {
              process id: 50, oracle@test_db
              session id: 369
    }
    which is waiting for 'library cache lock' with wait info:
    {
                      p1: 'handle address'=0x9aebb010     -- P1=0x9aebb010 TEST_TAB
                      p2: 'lock address'=0x9d8cc410
                      p3: '100*mode+namespace'=0x48d96600010003
                    Name: K.TEST_TAB   [Nsp: 1 Typ: 2 Id: 4774246]
           timeout after: 14 min 59 sec                   -- 'library cache lock' maximum 900 seconds
             current sql: ALTER CLUSTER test_cluster PCTFREE 18 INITRANS 19
    }
    and is blocked by the session at the start of the chain. 


4. GDB Debug



4.1 Meta Info


First, we collect some Meta Info of our test objects.

select name, type, addr, hash_value, v.* from v$db_object_cache v 
 where name in ('TEST_CLUSTER', 'TEST_CLUSTER_INDX', 'TEST_TAB');

  NAME               TYPE     ADDR      HASH_VALUE
  ------------------ -------- --------- ----------
  TEST_CLUSTER_INDX  INDEX    9A9F23B8   531250600
  TEST_CLUSTER       CLUSTER  9AE66A10  2742975119
  TEST_TAB           TABLE    9AEBB010  1905210899


-- Blog: Oracle ROWCACHE Views and Contents (http://ksun-oracle.blogspot.com/2018/10/oracle-rowcache-views.html)

----helper function----
create or replace function dump_hex2str (dump_hex varchar2) return varchar2 is
  l_str varchar2(100);
begin
  with sq_pos as (select level pos from dual connect by level <= 1000)
      ,sq_chr as (select pos, chr(to_number(substr(dump_hex, (pos-1)*2+1, 2), 'XX')) ch
                  from sq_pos where pos <= length(dump_hex)/2)
  select listagg(ch, '') within group (order by pos) word
    into l_str
  from sq_chr;
  return l_str;
end;
/

select * from (
  select to_number(ltrim((substr(key, 7, 2)||substr(key, 5, 2)||substr(key, 3, 2)||substr(key, 1, 2)), '0'), 'XXXX') Schema_User_ID,
         (select username from dba_users where user_id = 
            to_number(ltrim((substr(key, 7, 2)||substr(key, 5, 2)||substr(key, 3, 2)||substr(key, 1, 2)), '0'), 'XXXX')) username,
         dump_hex2str(rtrim(substr(key, 13), '0')) dc_object_name, 
         to_number(trim(both '0' from substr(key, 11, 2)||substr(key, 9, 2)), 'XXXX') key_str_len,
         v.* 
  from v$rowcache_parent v 
  where cache_name in ('dc_objects') 
) where dc_object_name = 'TEST_CLUSTER';

  DC_OBJECT_NAME  ADDRESS
  --------------  --------
  TEST_CLUSTER    9AE73170


4.2 GDB script


Based on above meta info, we compose a gdb script:

------------------------- gdb -x gdb_rowcache_libcache_deadlock -p 3214 -------------------------
# first run test_proc_alt_cluster(1) to make test reproducible.
# For kglnaobj_and_schema, Tracing Library Cache Locks 3: Object Names 
#     (https://nenadnoveljic.com/blog/library-cache-lock-object-name/)

# kgllkal.rdx = 9AEBB010 = v$db_object_cache.ADDR for TEST_TAB
# kqrLockPo.rsi=r15 = 9AE73170 = v$rowcache_parent.ADDRESS of TEST_CLUSTER	

set pagination off
set logging file /tmp/gdb.log
set logging overwrite on
set logging on
set $KGLLKHDL_ADDR=0X9AEBB010
set $ROWCACHE_PO_ADDR=0X9AE73170
set $s = 0
set $k = 0
set $r = 0

break *kgllkal if $rdx==$KGLLKHDL_ADDR
command
printf "S-%i ===== LibCache Seq: %i -- kgllkal (kgllkhdl Addr(rdx)=>%X, Mode(rcx)=>%X), kglnaobj_and_schema: %s --\n", ++$s, ++$k, $rdx, $rcx, ($rdx+0x1c0)
bt 3
cont
end

# kgllkdl or kglUnLock. ALTER CLUSTER GDB calls kgllkdl.
break *kgllkdl if $r8==$KGLLKHDL_ADDR
command
printf "S-%i ===== LibCache Seq: %i -- kgllkdl (kgllkhdl Addr(r8)=>%X) --\n", ++$s, ++$k, $r8
bt 3
cont
end

# kgllkdl or kglUnLock. QUERY GDB calls kglUnLock.
break *kglUnLock if $r8==$KGLLKHDL_ADDR || $r15==$KGLLKHDL_ADDR
command
printf "S-%i ===== LibCache Seq: %i -- kglUnLock (kgllkhdl Addr(r8)=>%X or Addr(r15)=>%X) --\n", ++$s, ++$k, $r8, $r15
bt 3
cont
end

break *kqrLockPo if $rsi==$ROWCACHE_PO_ADDR || $r15==$ROWCACHE_PO_ADDR
command
printf "S-%i ===== RowCache Seq: %i -- kqrLockPo (Addr(rsi)=>%X, Request Mode(rcx)=>%X) --\n", ++$s, ++$r, $rsi, $rcx
bt 3
cont
end

# kqrReleaseLock or kqreqd
break *kqrReleaseLock if $rdx==$ROWCACHE_PO_ADDR || $rdi==$ROWCACHE_PO_ADDR
command
printf "S-%i ===== RowCache Seq: %i -- kqrReleaseLock (Addr(rdx)=>%X) --\n", ++$s, ++$r, $rdx
bt 3
cont
end

cont


4.3 ALTER CLUSTER GDB


We run ALTER CLUSTER with gdb trace, here the output:

06:55:14 SQL >  exec test_proc_alt_cluster(1);

Breakpoint 4, 0x0000000012dd0890 in kqrLockPo ()
S-1 ===== RowCache Seq: 1 -- kqrLockPo (Addr(rsi)=>9AE73170, Request Mode(rcx)=>5) --
#0  0x0000000012dd0890 in kqrLockPo ()
#1  0x0000000012dce457 in kqrpre ()
#2  0x00000000030b573d in kkdlGetRowCacheOb ()

Breakpoint 1, 0x00000000130b0b40 in kgllkal ()
S-2 ===== LibCache Seq: 1 -- kgllkal (kgllkhdl Addr(rdx)=>9AEBB010, Mode(rcx)=>3), kglnaobj_and_schema: TEST_TABK --
#0  0x00000000130b0b40 in kgllkal ()
#1  0x00000000130ac291 in kglLock ()
#2  0x00000000130a70c5 in kglget ()

Breakpoint 1, 0x00000000130b0b40 in kgllkal ()
S-3 ===== LibCache Seq: 2 -- kgllkal (kgllkhdl Addr(rdx)=>9AEBB010, Mode(rcx)=>1), kglnaobj_and_schema: TEST_TABK --
#0  0x00000000130b0b40 in kgllkal ()
#1  0x00000000130ac291 in kglLock ()
#2  0x00000000130a70c5 in kglget ()

Breakpoint 2, 0x00000000130b34d0 in kgllkdl ()
S-4 ===== LibCache Seq: 3 -- kgllkdl (kgllkhdl Addr(r8)=>9AEBB010) --
#0  0x00000000130b34d0 in kgllkdl ()
#1  0x00000000130ae53c in kglUnLock ()
#2  0x0000000012bbaad1 in ktagetg0 ()

Breakpoint 5, 0x0000000012dd5b40 in kqrReleaseLock ()
S-5 ===== RowCache Seq: 2 -- kqrReleaseLock (Addr(rdx)=>9AE73170) --
#0  0x0000000012dd5b40 in kqrReleaseLock ()
#1  0x0000000012dd4e9a in kqreqd ()
#2  0x00000000128efaaa in kss_del_cb ()


4.4 Query GDB


We run Query with gdb trace, here the output:

06:57:32 SQL > exec test_proc_sel_tab(1);

Breakpoint 1, 0x00000000130b0b40 in kgllkal ()
S-1 ===== LibCache Seq: 1 -- kgllkal (kgllkhdl Addr(rdx)=>9AEBB010, Mode(rcx)=>2), kglnaobj_and_schema: TEST_TABK --
#0  0x00000000130b0b40 in kgllkal ()
#1  0x00000000130ac291 in kglLock ()
#2  0x00000000130a70c5 in kglget ()

Breakpoint 4, 0x0000000012dd0890 in kqrLockPo ()
S-2 ===== RowCache Seq: 1 -- kqrLockPo (Addr(rsi)=>9AE73170, Request Mode(rcx)=>3) --
#0  0x0000000012dd0890 in kqrLockPo ()
#1  0x0000000012dce457 in kqrpre ()
#2  0x00000000030bf8ab in kkdlgstcd ()

Breakpoint 5, 0x0000000012dd5b40 in kqrReleaseLock ()
S-3 ===== RowCache Seq: 2 -- kqrReleaseLock (Addr(rdx)=>9AE73170) --
#0  0x0000000012dd5b40 in kqrReleaseLock ()
#1  0x0000000012dd4e9a in kqreqd ()
#2  0x0000000012dd4833 in kqrprl ()

Breakpoint 4, 0x0000000012dd0890 in kqrLockPo ()
S-4 ===== RowCache Seq: 3 -- kqrLockPo (Addr(rsi)=>9AE73170, Request Mode(rcx)=>3) --
#0  0x0000000012dd0890 in kqrLockPo ()
#1  0x0000000012dd7b22 in kqrpre1 ()
#2  0x00000000032d0fdf in kqlobjlod ()

Breakpoint 5, 0x0000000012dd5b40 in kqrReleaseLock ()
S-5 ===== RowCache Seq: 4 -- kqrReleaseLock (Addr(rdx)=>9AE73170) --
#0  0x0000000012dd5b40 in kqrReleaseLock ()
#1  0x0000000012dd4e9a in kqreqd ()
#2  0x0000000012dd4833 in kqrprl ()

Breakpoint 3, 0x00000000130ae460 in kglUnLock ()
S-6 ===== LibCache Seq: 2 -- kglUnLock (kgllkhdl Addr(r8)=>9AEBB010) --
#0  0x00000000130ae460 in kglUnLock ()
#1  0x00000000030aaf05 in kkdcicf ()
#2  0x0000000003094ab6 in kkmpost ()

Breakpoint 3, 0x00000000130ae460 in kglUnLock ()
S-7 ===== LibCache Seq: 3 -- kglUnLock (kgllkhdl Addr(r8)=>9AEBB010) --
#0  0x00000000130ae460 in kglUnLock ()
#1  0x00000000030aaf05 in kkdcicf ()
#2  0x0000000003094ab6 in kkmpost ()


5. Root Cacuse Analysis


Look gdb lock request order of both sessions:

(a). ALTER CLUSTER GDB shows: 

       S-1 ===== kqrLockPo ()         'row cache lock'     reqeust on TEST_CLUSTER 
                                      
       S-2 ===== kgllkal ()           'library cache lock' allocate on TEST_TAB
                                      
       S-3 ===== kgllkal ()           'library cache lock' delete on TEST_TAB
                                      
       S-4 ===== kgllkdl ()           'library cache lock' delete on TEST_TAB
                                      
       S-5 ===== kqrReleaseLock ()    'row cache lock'     release on TEST_CLUSTER 

(b). Query GDB shows
        
       S-1 ===== kgllkal ()           'library cache lock' allocate on TEST_TAB
       
       S-2 ===== kqrLockPo ()         'row cache lock'     reqeust on TEST_CLUSTER
       
       S-3 ===== kqrReleaseLock ()    'row cache lock'     release on TEST_CLUSTER 
       
       S-4 ===== kqrLockPo ()         'row cache lock'     reqeust on TEST_CLUSTER
       
       S-5 ===== kqrReleaseLock ()    'row cache lock'     release on TEST_CLUSTER 
       
       S-6 ===== kglUnLock ()         'library cache lock' unlock on TEST_TAB
       
       S-7 ===== kglUnLock ()         'library cache lock' unlock on TEST_TAB
ALTER CLUSTER GDB shows that 'library cache lock' (from allocate till delete) is completely held within 'row cache lock' holding.
(hence both locks can be held at the same time, and 'row cache lock' can be held longer time)

Query GDB shows that 'row cache lock' (from request till release) is completely held within 'library cache lock' holding.
(hence both locks can be held at the same time, and 'library cache lock' can be held longer time)

S1 and S2 in ALTER CLUSTER GDB is reverse order of S1 and S2 in Query GDB.

Therefore the deadlock is caused by the crossing order of 'row cache lock' and 'library cache lock' reqeusts.

There are also similar cases of deadlocks documented in:

  Oracle MOS: Deadlock on Row Cache Lock and Library Cache Lock Dropping REF-PARTITION (Doc ID 2796372.1)

  Deadlock Caused by Row Cache and Library Cache Locks (https://nenadnoveljic.com/blog/deadlock-row-cache-lock-library-cache-lock/)