Thursday, February 24, 2022

Oracle Kill Session Crashes DB Instance

This blog demonstrates DB Instance crash caused by execution of statement:
    ALTER SYSTEM KILL SESSION 'sid,serial#' IMMEDIATE;
and alert.log error message:
    ORA-00700: soft internal error, arguments: [ksepop:1 ksepop recursion ]
    ORA-00601: cleanup lock conflict
    PMON: terminating the instance due to ORA error 12752
    Cause - 'Instance is being terminated due to fatal process death (CL01)'
Note: Tested on Oracle 19.13 in Linux.

Update (2023-02-26): By this Blog, Oracle fixed the described and reproduced problem with Bug 33060212 (from 19.18 in January 2023):
  Bug 33060212 - ORA-07445 [kgldmp0] and Instance Termination Due to Background Process CLMN (Doc ID 33060212.8)
    The fix for 33060212 is first included in 19.18.0.0.230117 (January 2023) DB Release Update (DB RU)


1. Test Setup


Open 3 Sqlplus sessions (SID-1, SID-2, SID-3). SID-1 and SID-2 are two test sessions. SID-3 is a monitor session.
In the test, SID-1 is "blocked session", and SID-2 is "blocker session".

At first, in each session, we run a query to get its session info (SID, SERIAL#, PID, SPID).

In SID-1, we also run two queries on dbms_standard to warm up (avoid hard parsing), which will be used in later test.

---================== SID-1 (blocked session, SID=100, SERIAL#=1001, SPID=10011) ==================---
-- grant execute on sys.dbms_support to k;
 
15:25:41 SID-1 > select s.program, s.sid, s.serial#, pid, spid, last_call_et
                 from v$session s, v$process p where s.paddr=p.addr and s.sid = sys.dbms_support.mysid;
 
  PROGRAM       SID SERIAL#  PID SPID     LAST_CALL_ET
  ------------ ---- ------- ---- -------- ------------
  sqlplus.exe   100    1001   51 10011               0
 
15:25:46 SID-1 > select localtimestamp, dbms_standard.ora_max_name_len_supported from dual;
 
  LOCALTIMESTAMP       ORA_MAX_NAME_LEN_SUPPORTED
  -------------------- --------------------------
  15:25:47                                    128
 
15:25:48 SID-1 > select dbms_standard.database_name from dual;
 
  DATABASE_NAME
  ----------------------------------------
  TESTDB
 
---================== SID-2 (blocker session, SID=200, SERIAL#=2002, SPID=20022) ==================---
 
15:26:01 SID-2 > select s.program, s.sid, s.serial#, pid, spid, last_call_et
                 from v$session s, v$process p where s.paddr=p.addr and s.sid = sys.dbms_support.mysid;
 
  PROGRAM       SID SERIAL#  PID SPID     LAST_CALL_ET
  ------------ ---- ------- ---- -------- ------------
  sqlplus.exe   200    2002   52 20022               0
 
---================== SID-3 (monitor session, SID=300, SERIAL#=3003, SPID=30033) ==================---
 
15:26:08 SID-3 > select s.program, s.sid, s.serial#, pid, spid, last_call_et
                 from v$session s, v$process p where s.paddr=p.addr and s.sid = sys.dbms_support.mysid;
 
  PROGRAM       SID SERIAL#  PID SPID     LAST_CALL_ET
  ------------ ---- ------- ---- -------- ------------
  sqlplus.exe   300    3003   53 30033               0


2. Test Run


We start a gdb on SID-2 process, set breakpoint on "kgxRelease". In each step, we print out "step_count", and display actual time by "shell date".

Then in Sqlplus SID-2, we run:

   SID-2 > select dbms_standard.database_name from dual;
It is waiting for gdb instruction.

Now we run gdb continue command. It stops on "kgxRelease", which indicates that SID-2 is holding a mutex (by kglGetMutex -> kgxExclusive), but not yet released by "kgxRelease".

We go to Sqlplus SID-1, run:

   SID-1 (Run-1) > select localtimestamp, dbms_standard.ora_max_name_len_supported from dual;
It returns and not blocked.

Repeat the same actions ("gdb kgxRelease" and "SID-1 run") till SID-1 is hanging.

In our test, it occurs in Step-8 (see following test action flow diagram). At this moment, it signifies that SID-1 is blocked by a mutex held by SID-2. SID-1 is waiting on event "library cache: mutex X" (see later Section 3).

Then, from SID-3, we run "kill session immediate" on SID-1:

   15:32:41 SID-3 > alter system kill session '100,1001' immediate;
 
   (Note* instead of killing "blocker session" SID-2, we kill "blocked session" SID-1)
Now SID-1 shows "ORA-03114: not connected to ORACLE".

Immediately from v$session, we saw both SID-1 and one CLMN/CL01 are waiting on event "library cache: mutex X" (see later Section 3). In order to to cleanup killed processes (Marked process newly dead), CL01 from PMON group (PMON, CLMN, CLnn) is called, but also blocked (see later Section 3).

After about 5 minutes, DB instance crashed.

========= gdb Session-2 process ========= | ================================ SQL Session-1 ===================================
oracle@testdb > gdb -p 20022              |
                                          |
(gdb) break kgxRelease                    |
Breakpoint 1 at 0x12f90180                |
(gdb) set $step_count = 0                 |
(gdb) display $step_count++               |
1: $step_count++ = 0                      |
(gdb) shell date                          |
15:28:02                                  |
                                          |
-------------------------- 15:28:30 SID-2 > select dbms_standard.database_name from dual; ---------------------------------
                                          |                                         
------------------------------------- Step-1 --------------------------------------------------------------------------------
(gdb) c                                   |
Continuing.                               |
Breakpoint 1, 0x12f90180 in kgxRelease () |
1: $step_count++ = 1                      |
(gdb) shell date                          |
15:28:58                                  |
                                       --->>>
                                          | SID-1 (Run-1) > select localtimestamp, dbms_standard.ora_max_name_len_supported from dual;
                                          |
                                          |   LOCALTIMESTAMP       ORA_MAX_NAME_LEN_SUPPORTED
                                          |   -------------------- --------------------------
                                          |   15:29:01                                    128
------------------------------------- Step-2 --------------------------------------------------------------------------------
(gdb) c                                   |
Continuing.                               |
Breakpoint 1, 0x12f90180 in kgxRelease () |
1: $step_count++ = 2                      |
(gdb) shell date                          |
15:29:17                                  |
                                       --->>>
                                          | SID-1 (Run-2) > select localtimestamp, dbms_standard.ora_max_name_len_supported from dual;
                                          |
                                          |   LOCALTIMESTAMP       ORA_MAX_NAME_LEN_SUPPORTED
                                          |   -------------------- --------------------------
                                          |   15:29:27                                    128
------------------------------------- Step-3 --------------------------------------------------------------------------------
(gdb) c                                   |
Continuing.                               |
Breakpoint 1, 0x12f90180 in kgxRelease () |
1: $step_count++ = 3                      |
(gdb) shell date                          |
15:29:32                                  |
                                       --->>>
                                          | SID-1 (Run-3) > select localtimestamp, dbms_standard.ora_max_name_len_supported from dual;
                                          |
                                          |   LOCALTIMESTAMP       ORA_MAX_NAME_LEN_SUPPORTED
                                          |   -------------------- --------------------------
                                          |   15:29:37                                    128
------------------------------------- Step-4 --------------------------------------------------------------------------------
(gdb) c                                   |
Continuing.                               |
Breakpoint 1, 0x12f90180 in kgxRelease () |
1: $step_count++ = 4                      |
(gdb) shell date                          |
15:29:45                                  |
                                       --->>>
                                          | SID-1 (Run-4) > select localtimestamp, dbms_standard.ora_max_name_len_supported from dual;
                                          |
                                          |   LOCALTIMESTAMP       ORA_MAX_NAME_LEN_SUPPORTED
                                          |   -------------------- --------------------------
                                          |   15:29:50                                    128
------------------------------------- Step-5 --------------------------------------------------------------------------------
(gdb) c                                   |
Continuing.                               |
Breakpoint 1, 0x12f90180 in kgxRelease () |
1: $step_count++ = 5                      |
(gdb) shell date                          |
15:29:57                                  |
                                       --->>>
                                          | SID-1 (Run-5) > select localtimestamp, dbms_standard.ora_max_name_len_supported from dual;
                                          |
                                          |   LOCALTIMESTAMP       ORA_MAX_NAME_LEN_SUPPORTED
                                          |   -------------------- --------------------------
                                          |   15:30:06                                    128
------------------------------------- Step-6 --------------------------------------------------------------------------------
(gdb) c                                   |
Continuing.                               |
Breakpoint 1, 0x12f90180 in kgxRelease () |
1: $step_count++ = 6                      |
(gdb) shell date                          |
15:30:11                                  |
                                       --->>>
                                          | SID-1 (Run-6) > select localtimestamp, dbms_standard.ora_max_name_len_supported from dual;
                                          |
                                          |   LOCALTIMESTAMP       ORA_MAX_NAME_LEN_SUPPORTED
                                          |   -------------------- --------------------------
                                          |   15:30:38                                    128
------------------------------------- Step-7 --------------------------------------------------------------------------------
(gdb) c                                   |
Continuing.                               |
Breakpoint 1, 0x12f90180 in kgxRelease () |
1: $step_count++ = 7                      |
(gdb) shell date                          |
15:30:46                                  |
                                       --->>>
                                          | SID-1 (Run-7) > select localtimestamp, dbms_standard.ora_max_name_len_supported from dual;
                                          |
                                          |   LOCALTIMESTAMP       ORA_MAX_NAME_LEN_SUPPORTED
                                          |   -------------------- --------------------------
                                          |   15:30:50                                    128
------------------------------------- Step-8 --------------------------------------------------------------------------------
(gdb) c                                   |
Continuing.                               |
Breakpoint 1, 0x12f90180 in kgxRelease () |
1: $step_count++ = 8                      |
(gdb) shell date                          |
15:31:05                                  |
                                       --->>>
                                          | SID-1 (Run-8) > select localtimestamp, dbms_standard.ora_max_name_len_supported from dual;
                                          | select localtimestamp, dbms_standard.ora_max_name_len_supported from dual
                                          |
                                          | *** SID-1 is blocked ***
                                          |                                                                                     
-------------------------- 15:32:41 SID-3 > alter system kill session '100,1001' immediate; ---------------------------------
                                          |
                                          | ERROR at line 1:
                                          | ORA-03113: end-of-file on communication channel
                                          | Process ID: 10011
                                          | Session ID: 100 Serial number: 1001
                                          | ERROR:
                                          | ORA-03114: not connected to ORACLE
                                          |
-----------------------------------------------------------------------------------------------------------------------------
From gdb, we can also show call stack and register info of of SID-2.

(gdb) display $r8d                       
2: $r8d = 4                              
 
(gdb) bt
#0  0x12f90180 in kgxRelease ()
#1  0x12f6a588 in kglReleaseMutex ()
#2  0x12f635b7 in kglLoadOnLock ()
#3  0x12f62255 in kgllkal ()
#4  0x12f5d90e in kglLock ()
#5  0x12f587d5 in kglget ()
#6  0x034d1e68 in kksaxs ()
#7  0x034d016d in kksauc ()
#8  0x034e3db9 in kkscscid_auc_eval ()
#9  0x12c3d85a in kkscsCheckCriteria ()
#10 0x034e1d17 in kkscsCheckCursor ()
#11 0x034e0eec in kkscsSearchChildList ()
#12 0x12c2e5c6 in kksfbc ()
#13 0x12c27e37 in kkspsc0 ()
#14 0x12c27552 in kksParseCursor ()
#15 0x12e139c6 in opiosq0 ()
#16 0x12b7d71f in kpooprx ()
#17 0x12b7b115 in kpoal8 ()
#18 0x12b08902 in opiodr ()
#19 0x12ed2fbe in ttcpip ()
#20 0x028afa8c in opitsk ()
#21 0x028b43e8 in opiino ()
#22 0x12b08902 in opiodr ()
#23 0x028ab806 in opidrv ()
#24 0x03425585 in sou2o ()
#25 0x00dcb266 in opimai_real ()
#26 0x03430fb1 in ssthrdmain ()
#27 0x00dcb090 in main ()
 
(gdb) info r
rax            0x0      0
rbx            0x7fecd71ec9c0   140655198128576
rcx            0x7fecd70f4738   140655197112120
rdx            0x0      0
rsi            0x9b71f4b0       2607936688
rdi            0x7fecd71ec9c0   140655198128576
rbp            0x7ffebb73bbe0   0x7ffebb73bbe0
rsp            0x7ffebb73bb78   0x7ffebb73bb78
r8             0x4      4
r9             0x80002801       2147493889
r10            0x0      0
r11            0x2      2
r12            0x9b71f4b0       2607936688
r13            0xa27c0820       2726037536
r14            0x2      2
r15            0x7ffebb73c9a0   140732043348384
rip            0x12f90180       0x12f90180 
eflags         0x246    [ PF ZF IF ]
cs             0x33     51
ss             0x2b     43
Once SID-1 is blocked in Step-8, if we make an errorstack dump of SID-1 before SID-1 kill session, the Call Stack shows that SID-1 is blocked in "kglGetMutex -> kgxExclusive -> kgxWait":

oradebug dump errorstack 3
 
  ----- Current SQL Statement for this session (sql_id=27t9knmt7u5pq) -----
  select localtimestamp, dbms_standard.ora_max_name_len_supported from dual
 
  ----- Call Stack Trace -----
    FRAME [1] (ksedst1()+95 -> kgdsdst())
    FRAME [2] (ksedst()+58 -> ksedst1())
    FRAME [3] (dbkedDefDump()+23448 -> ksedst())
    FRAME [4] (ksedmp()+577 -> dbkedDefDump())
    FRAME [5] (ksdxdmp()+1425 -> ksedmp())
    FRAME [6] (ksdxfdmp()+152 -> ksdxdmp())
    FRAME [7] (ksdxcb()+872 -> ksdxfdmp())
    FRAME [8] (sspuser()+200 -> ksdxcb())
    FRAME [9] (__sighandler() -> sspuser())
    FRAME [10] (semtimedop()+10 -> __sighandler())
    FRAME [11] (sskgpwwait()+245 -> semtimedop())
    FRAME [12] (skgpwwait()+187 -> sskgpwwait())
    FRAME [13] (ksliwat()+2293 -> skgpwwait())
    FRAME [14] (kslwaitctx()+200 -> ksliwat())
    FRAME [15] (kgxWait()+1291 -> kslwaitctx())
    FRAME [16] (kgxExclusive()+712 -> kgxWait())
    FRAME [17] (kglGetMutex()+147 -> kgxExclusive())
    FRAME [18] (kglpin()+696 -> kglGetMutex())
    FRAME [19] (kglpnp()+439 -> kglpin())
    FRAME [20] (kgiina()+352 -> kglpnp())
    FRAME [21] (kgiinp()+39 -> kgiina())
    FRAME [22] (peiinspbn()+146 -> kgiinp())
    FRAME [23] (kkxpispbn()+212 -> peiinspbn())
    FRAME [24] (kgmexec()+688 -> kkxpispbn())
    FRAME [25] (evapls()+1270 -> kgmexec())
    FRAME [26] (evaopn2()+747 -> evapls())
    FRAME [27] (kpofcr()+7390 -> evaopn2())
    FRAME [28] (qerfiFetch()+143 -> kpofcr())
    FRAME [29] (opifch2()+3211 -> qerfiFetch())
    FRAME [30] (kpoal8()+4185 -> opifch2())
    FRAME [31] (opiodr()+1202 -> kpoal8())
    FRAME [32] (ttcpip()+1246 -> opiodr())


3. Session Event Monitoring During Test


From SID-3 (monitor session), we can see the Session Event change in three steps:
    (1). No Blocking
    (2). User Session Blocked on "library cache: mutex X" when mutex not released (kgxRelease) before Kill Session
    (3). Both User Session and CL01 Blocked on "library cache: mutex X" when mutex not released (kgxRelease) after Kill Session

----------================== SID-3 (monitor session, SID=300, SERIAL#=3003, SPID=30033) ===========================================================
-----****** Before blocking
 
15:30:30 SID-3 > select s.program, event, s.sid, s.serial#, pid, spid, p1raw, p2raw, p3raw, last_call_et, p1text, p2text, p3text
                 from v$session s, v$process p
                 where s.paddr=p.addr and (lower(s.program) like '%sql%' or lower(s.program) like '%pmon%'
                                        or lower(s.program) like '%(cl%' or lower(s.program) like '%(dia%')
                 order by s.program desc;
 
PROGRAM               EVENT                         SID SERIAL#  PID SPID  P1RAW    P2RAW            P3RAW            LAST_CALL_ET P1TEXT     P2TEXT  P3TEXT
--------------------- ---------------------------- ---- ------- ---- ----- -------- ---------------- ---------------- ------------ ---------  ------- --------------------
sqlplus.exe           SQL*Net message from client   100    1001   51 10011 54435000 0000000000000001 00                         24 driver id  #bytes
sqlplus.exe           SQL*Net message from client   200    2002   52 20022 54435000 0000000000000001 00                         61 driver id  #bytes
sqlplus.exe           SQL*Net message to client     300    3003   53 30033 54435000 0000000000000001 00                          0 driver id  #bytes
oracle@testdb (PMON)  pmon timer                    359   33127    2 31362 0000012C 00               00                        562 duration  
oracle@testdb (DIAG)  DIAG idle wait                897   49978   11 31394 00000003 0000000000000001 00                        560 component  where   wait time(millisec)
oracle@testdb (DIA0)  DIAG idle wait                  3   31438   18 31411 00000003 0000000000000001 00                        560 component  where   wait time(millisec)
oracle@testdb (CLMN)  pmon timer                    539   36539    3 31366 0000012C 00               00                        562 duration  
oracle@testdb (CL01)  pmon timer                    186   46807   49 31939 0000012C 00               00                        465 duration  
oracle@testdb (CL00)  pmon timer                    722   15363   34 31937 0000012C 00               00                        465 duration  
 
 
-----****** During "library cache: mutex X" blocking of User Session SID-1 (SID=100) before Kill Session
 
15:31:26 SID-3 > select s.program, event, s.sid, s.serial#, pid, spid, p1raw, p2raw, p3raw, last_call_et, p1text, p2text, p3text
                 from v$session s, v$process p
                 where s.paddr=p.addr and (lower(s.program) like '%sql%' or lower(s.program) like '%pmon%'
                                        or lower(s.program) like '%(cl%' or lower(s.program) like '%(dia%')
                 order by s.program desc;
 
PROGRAM               EVENT                         SID SERIAL#  PID SPID  P1RAW    P2RAW            P3RAW            LAST_CALL_ET P1TEXT    P2TEXT  P3TEXT
--------------------- ---------------------------- ---- ------- ---- ----- -------- ---------------- ---------------- ------------ --------- ------  --------------------
sqlplus.exe           library cache: mutex X        100    1001   51 10011 7CE2FAC4 00000C8000000000 000004C500010004           18 idn       value   where
sqlplus.exe           SQL*Net message from client   200    2002   52 20022 54435000 0000000000000001 00                        117 driver id #bytes
sqlplus.exe           SQL*Net message to client     300    3003   53 30033 54435000 0000000000000001 00                          0 driver id #bytes
oracle@testdb (PMON)  pmon timer                    359   33127    2 31362 0000012C 00               00                        618 duration        
oracle@testdb (DIAG)  DIAG idle wait                897   49978   11 31394 00000003 0000000000000001 00                        616 component where   wait time(millisec)
oracle@testdb (DIA0)  DIAG idle wait                  3   31438   18 31411 00000003 0000000000000001 00                        616 component where   wait time(millisec)
oracle@testdb (CLMN)  pmon timer                    539   36539    3 31366 0000012C 00               00                        618 duration        
oracle@testdb (CL01)  pmon timer                    186   46807   49 31939 0000012C 00               00                        521 duration        
oracle@testdb (CL00)  pmon timer                    722   15363   34 31937 0000012C 00               00                        521 duration        
 
 
-----****** "kill session immediate" on SID-1
 
15:32:41 SID-3 > alter system kill session '100,1001' immediate;
 
   System altered.
 
 
-----****** During "library cache: mutex X" blocking of both User Session SID-1 (SID=100), and CLMN/CLnn (CL01 SID=186) after Kill Session
 
15:33:05 SID-3 > select s.program, event, s.sid, s.serial#, pid, spid, p1raw, p2raw, p3raw, last_call_et, p1text, p2text, p3text
                 from v$session s, v$process p
                 where s.paddr=p.addr and (lower(s.program) like '%sql%' or lower(s.program) like '%pmon%'
                                        or lower(s.program) like '%(cl%' or lower(s.program) like '%(dia%')
                 order by s.program desc;
 
PROGRAM               EVENT                         SID SERIAL#  PID SPID  P1RAW    P2RAW            P3RAW            LAST_CALL_ET P1TEXT     P2TEXT  P3TEXT
--------------------- ---------------------------- ---- ------- ---- ----- -------- ---------------- ---------------- ------------ ---------- ------- --------------------
sqlplus.exe           library cache: mutex X        100    1001   51 10011 7CE2FAC4 00000C8000000000 000004C500010004          117 idn        value   where
sqlplus.exe           SQL*Net message from client   200    2002   52 20022 54435000 0000000000000001 00                        216 driver id  #bytes
sqlplus.exe           SQL*Net message to client     300    3003   53 30033 54435000 0000000000000001 00                          0 driver id  #bytes
oracle@testdb (PMON)  pmon timer                    359   33127    2 31362 0000012C 00               00                        717 duration
oracle@testdb (DIAG)  DIAG idle wait                897   49978   11 31394 00000003 0000000000000001 00                        715 component  where   wait time(millisec)
oracle@testdb (DIA0)  DIAG idle wait                  3   31438   18 31411 00000003 0000000000000001 00                        715 component  where   wait time(millisec)
oracle@testdb (CLMN)  pmon timer                    539   36539    3 31366 0000012C 00               00                        717 duration
oracle@testdb (CL01)  library cache: mutex X        186   46807   49 31939 7CE2FAC4 00000C8000000000 000004C50001004F          620 idn        value   where
oracle@testdb (CL00)  pmon timer                    722   15363   34 31937 0000012C 00               00                        620 duration
In some tests, instead of CLnn, we also saw CLMN is waiting for "library cache: mutex X".

P2RAW in Event "library cache: mutex X" is mutex "value", which contains mutex holder (blocker) SID. In this test, it is 200 = 0xC8.

P3RAW in Event "library cache: mutex X" is mutex "where". Its last 8-bit byte indicates the blocking subroutine location.
In this test, it is 79 = 0x4F (kglrfcl1).

We can print out those locations from kglMutexLocations[] array:

#Define Command to PrintkglMutexLocations
define PrintkglMutexLocations
  set $i = 0
  while $i < $arg0 + $arg0
    x /s *(&kglMutexLocations + $i)
    set $i = $i + 2
  end
end
 
(gdb) PrintkglMutexLocations 174
 
  kglpin1    4 (0x04)
  kglhdgn1  62 (0x3E)
  kglrfcl1  79 (0x4F)
  kgllkdl1  85 (0x55)
  kglhdgn2 106 (0x6A)
In a few tests, we saw CLnn having "P3RAW=000004C500010055" on "library cache: mutex X". After about two minutes (see Section DB alert.log), DIAG (DIA0) was active to perform certain actions on Hung Sessions, and CLMN took over "library cache: mutex X" with "P3RAW=000004C50001004F" on "library cache: mutex X". (changed from kgllkdl1 (0x55: library cache lock delete) to kglrfcl1 (0x4F: library cache reference clear))


4. DB alert.log, Dumps and Incident Files


For instance crash, we can also have a look of alert.log, dumps from user and system sessions (CLMN/CLnn, PMON, DIAG/DIAn, MMON/MMON_SLAVE).


4.1 DB alert.log


alert.log showed that KILL SESSION at 15:32:41, DIA0 is active at 15:35:12 (151 seconds later), and finally at 15:37:59 (167 seconds later), errors are signalled and Instance termination was alerted:

    ORA-00700: soft internal error, arguments: [ksepop:1 ksepop recursion ]
    ORA-00601: cleanup lock conflict
   'Instance is being terminated due to fatal process death (pid: 49, ospid: 31939, CL01)'

----------================== alert.log ==================----------
15:32:41.540893
Process termination requested for pid 10011 [source = rdbms], [info = 2] [request issued by pid: 30033, uid: 100]
15:32:41.590770
KILL SESSION for sid=(100, 1001):
  Reason = alter system kill session
  Mode = KILL HARD SAFE -/-/-
  Requestor = USER (orapid = 53, ospid = 30033, inst = 1)
  Owner = Process: USER (orapid = 51, ospid = 10011)
  Result = ORA-0
15:35:12.167943
DIA0 Critical Database Process Blocked: Hang ID 1 blocks 2 sessions
     Final blocker is session ID 200 serial# 2002 OSPID 20022 on Instance 1
     No resolution will be attempted by Hang Manager
15:37:55.685771
Errors in file /orabin/app/oracle/admin/testdb/diag/rdbms/testdb/testdb/trace/testdb_cl01_31939.trc  (incident=89993):
ORA-00700: soft internal error, arguments: [ksepop:1 ksepop recursion ], [], [], [], [], [], [], [], [], [], [], []
ORA-00601: cleanup lock conflict
ORA-00601: cleanup lock conflict
Incident details in: /orabin/app/oracle/admin/testdb/diag/rdbms/testdb/testdb/incident/incdir_89993/testdb_cl01_31939_i89993.trc
15:37:57.240744
Dumping diagnostic data in directory=[cdmp_20220222153757], requested by (instance=1, osid=31939 (CL01)), summary=[incident=89993].
15:37:59.637484
PMON (ospid: 31362): terminating the instance due to ORA error 12752
15:37:59.637658
Cause - 'Instance is being terminated due to fatal process death (pid: 49, ospid: 31939, CL01)'
15:37:59.638404
System state dump requested by (instance=1, osid=31362 (PMON)), summary=[abnormal instance termination].
At 15:35:12.167943, DIA0 detected that 2 sessions were blocked by SID-2 (session ID 200 serial# 2002 OSPID 20022 ). The 2 blocked sessions are SID-1 (orapid = 51, ospid = 10011) and CL01.


4.2 CL01 Dump


cl01.trc looks like:

----------================== testdb_cl01_31939.trc ==================----------
Unix process pid: 31939, image: oracle@testdb (CL01)
*** 15:32:41.593097
*** SESSION ID:(186.46807) 15:32:41.593132
 
KGX cleanup...
KGX Atomic Operation Log 0x9b77d058
Mutex 0xa27c0820(727, 0) idn 7ce2fac4 oper GET_EXCL(5)
Library Cache uid 548 efd 7 whr 4 slp 9271
oper=0 pt1=0xa27c06d0 pt2=(nil) pt3=(nil)
pt4=(nil) pt5=(nil) ub4=0 flg=0x0 uw1=0 uw2=0
msk=0000-0000-0000-0000-0000
 
*** 15:35:11.727606
KGL UOL cleanup...
KGX Atomic Operation Log 0x9c44cf58
Mutex (nil)(0, 0) idn 0 oper NONE(0)
Library Cache uid 186 efd 15 whr 77 slp 0
oper=259 pt1=0x9b6a3938 pt2=0x9b6a2800 pt3=(nil)
pt4=0x9b6a3f58 pt5=(nil) ub4=0 flg=0x0 uw1=0 uw2=0
msk=0000-0000-0000-0000-0000
 
LibraryHandle:  Address=0x9b6a3938 Hash=0 LockMode=0 PinMode=0 LoadLockMode=0 Status=INVL Subpool=1
  Name:  Namespace=SQL AREA(00) Type=CURSOR(00) ContainerId=0
  Statistics:  InvalidationCount=1 ExecutionCount=9 LoadCount=1 ActiveLocks=0 TotalLockCount=3 TotalPinCount=10
  Counters:  BrokenCount=2 RevocablePointer=2 KeepDependency=0 Version=0 BucketInUse=0 HandleInUse=0 HandleReferenceCount=0
  Concurrency:  DependencyMutex=0x9b6a39e8(0, 0, 0, 0) Mutex=0x9b6a51e0(186, 34, 0, 6)
  Flags=RON/PIN/PN0/EXP/CHD/[10010111] Flags2=[0000]
  WaitersLists: 
    Lock=0x9b6a39c8[0x9b6a39c8,0x9b6a39c8]
    Pin=0x9b6a39a8[0x9b6a39a8,0x9b6a39a8]
    LoadLock=0x9b6a3a20[0x9b6a3a20,0x9b6a3a20]
  LibraryObject:  Address=0x9b6a2800 HeapMask=0000-0000-0000-0000
  NamespaceDump: 
    Child Cursor:  Heap0=0x9b6a28e0 Heap6=0x7f65526469c0 Heap0 Load Time=15:25:46 Heap6 Load Time=15:25:46 15:37:55.686246
Incident 89993 created, dump file: /orabin/app/oracle/admin/testdb/diag/rdbms/testdb/testdb/incident/incdir_89993/testdb_cl01_31939_i89993.trc
 
*** 15:37:55.686302
ORA-00700: soft internal error, arguments: [ksepop:1 ksepop recursion ], [], [], [], [], [], [], [], [], [], [], []
ORA-00601: cleanup lock conflict
It shows "KGX cleanup" on "idn 7ce2fac4" at 15:32:41, and "KGL UOL cleanup" at 15:35:11 (it is close to above DIA0 detection time of 15:35:12).

With following query, we can see "7CE2FAC4" (HASH_VALUE=2095250116) is SYS.DBMS_STANDARD Package Spec, which is also showed as P1RAW for the session event "library cache: mutex X".

select owner, name, namespace, type, hash_value, to_char(hash_value, 'XXXXXXXX') hash_value_HEX
from v$db_object_cache v where name like 'DBMS_STANDARD';
 
  OWNER    NAME                 NAMESPACE            TYPE       HASH_VALUE HASH_VALUE_HEX
  -------- -------------------- -------------------- ---------- ---------- --------------------
  K        DBMS_STANDARD        TABLE/PROCEDURE      CURSOR      599930705  23C23751
  PUBLIC   DBMS_STANDARD        TABLE/PROCEDURE      SYNONYM    2691725148  A070775C
 SYS      DBMS_STANDARD        TABLE/PROCEDURE      PACKAGE    2095250116  7CE2FAC4
  SYS      DBMS_STANDARD        BODY                 CURSOR     3231142607  C09752CF
Text "Mutex=0x9b6a51e0(186, 34, 0, 6)" shows cl01 (SESSION ID: 186) in Execlusuve lock (6).


4.3 CL01 Incident File



----------================== incident/incdir_89993/testdb_cl01_31939_i89993.trc ==================----------
*** SESSION ID:(186.46807) 15:37:55
ORA-00700: soft internal error, arguments: [ksepop:1 ksepop recursion ], [], [], [], [], [], [], [], [], [], [], []
 
  ----- Call Stack Trace -----
  FRAME [11] (kgesoftnmierr()+712 -> kgerinv_internal())
  FRAME [12] (ksepop()+1007 -> kgesoftnmierr())
  FRAME [13] (kgepop()+135 -> ksepop())
  FRAME [14] (kgesecl0()+145 -> kgepop())
  FRAME [15] (kgxWait()+1403 -> kgesecl0())
  FRAME [16] (kgxExclusive()+712 -> kgxWait())
  FRAME [17] (kglGetMutex()+147 -> kgxExclusive())
  FRAME [18] (kglrfcl()+354 -> kglGetMutex())
  FRAME [19] (kglobcl()+589 -> kglrfcl())
  FRAME [20] (kglobfr()+366 -> kglobcl())
  FRAME [21] (kgllccl()+695 -> kglobfr())
  FRAME [22] (kglMutexCleanupAll()+1499 -> kgllccl())
  FRAME [23] (kglOnErrorMutexCleanup()+76 -> kglMutexCleanupAll())
  FRAME [24] (ksepop()+351 -> kglOnErrorMutexCleanup())
  FRAME [25] (kgepop()+135 -> ksepop())
  FRAME [26] (kgepop()+472 -> kgepop())
  FRAME [27] (ksesecl0()+189 -> kgepop())
  FRAME [28] (ksucln_bump_cleanup_timer()+1053 -> ksesecl0())
  FRAME [29] (ksucln_check_wait_tasks()+137 -> ksucln_bump_cleanup_timer())
  FRAME [30] (ksliwat()+3923 -> ksucln_check_wait_tasks())
  FRAME [31] (kslwaitctx()+200 -> ksliwat())
  FRAME [32] (kgxWait()+1291 -> kslwaitctx())
  FRAME [33] (kgxExclusive()+712 -> kgxWait())
  FRAME [34] (kglGetMutex()+147 -> kgxExclusive())
  FRAME [35] (kglrfcl()+354 -> kglGetMutex())
  FRAME [36] (kglobcl()+589 -> kglrfcl())
  FRAME [37] (kglobfr()+366 -> kglobcl())
  FRAME [38] (kglobf0()+327 -> kglobfr())
  FRAME [39] (kgllkdl()+1532 -> kglobf0())
  FRAME [40] (kss_del_cb()+811 -> kgllkdl())
 
    Current Wait Stack:
     1: waiting for 'library cache: mutex X'
        idn=0x7ce2fac4, value=0xc8000000000, where=0x4c50001004f
        wait_id=202 seq_num=203 snap_id=1
        wait times: snap=2 min 44 sec, exc=2 min 44 sec, total=2 min 44 sec
        wait times: max=infinite, heur=2 min 44 sec
        wait counts: calls=16383 os=16383
        in_wait=1 iflags=0x1532
     0: waiting for 'library cache: mutex X'
        idn=0x7ce2fac4, value=0xc8000000000, where=0x4c50001004f
        wait_id=201 seq_num=202 snap_id=1
        wait times: snap=5 min 14 sec, exc=2 min 30 sec, total=5 min 14 sec
        wait times: max=infinite, heur=5 min 14 sec
        wait counts: calls=15000 os=15000
        in_wait=1 iflags=0x15b2
    There is at least one session blocking this session.
      Dumping 1 direct blocker(s):
        inst: 1, sid: 200, ser: 2002
      Dumping final blocker:
        inst: 1, sid: 200, ser: 2002
The error message said:

   "ORA-00700: soft internal error, arguments: [ksepop:1 ksepop recursion ]"
In the Call Stack, we can see two same calls in two different frames:

  FRAME [13] (kgepop()+135 -> ksepop())
  FRAME [25] (kgepop()+135 -> ksepop())
Probably that is interpreted as "recursion" in "[ksepop:1 ksepop recursion ]".

In "Current Wait Stack" section, we can see 'library cache: mutex X' with "idn=0x7ce2fac4", and blocker is SID-1 (sid: 200, ser: 2002).


4.4 SID-1 (blocked session) Dump


SID-1 is waiting for 'library cache: mutex X' with "idn=0x7ce2fac4" and "where=0x4c500010004" (kglpin1).

----------================== testdb_ora_10011_bucket.trc ==================----------
Unix process pid: 32246, image: oracle@testdb
*** SESSION ID:(100.1001) 15:37:57.302316
 
*** 15:37:57.302326
Process diagnostic dump for oracle@testdb, OS id=10011,
-------------------------------------------------------------------------------
current sql: 
client details:
  machine: SYS\M7080 program: sqlplus.exe
Current Wait Stack:
0: waiting for 'library cache: mutex X'
    idn=0x7ce2fac4, value=0xc8000000000, where=0x4c500010004
    wait_id=357 seq_num=359 snap_id=1
    wait times: snap=6 min 48 sec, exc=6 min 48 sec, total=6 min 48 sec
    wait times: max=infinite, heur=6 min 48 sec
    wait counts: calls=9271 os=9271
    in_wait=1 iflags=0x55b2


4.5 SID-2 (blocker session) Dump


There are 2 sessions blocked by this session (SID-2). One waiter is CL01 (sid: 186, ser: 46807), waiting for 'library cache: mutex X' with "idn=0x7ce2fac4" and "where=0x4c50001004f" (kglrfcl1).

----------================== testdb_ora_20022_bucket.trc ==================----------
Unix process pid: 20022, image: oracle@testdb
*** SESSION ID:(200.2002) 15:37:57.303902
*** 15:37:57.303913
Process diagnostic dump for oracle@testdb, OS id=20022,
-------------------------------------------------------------------------------
current sql: select dbms_standard.database_name from dual
client details:
  machine: SYS\M7080 program: sqlplus.exe
Current Wait Stack:
  Not in wait; last wait ended 9 min 3 sec ago
There are 2 sessions blocked by this session.
Dumping one waiter:
  inst: 1, sid: 186, ser: 46807
  wait event: 'library cache: mutex X'
    p1: 'idn'=0x7ce2fac4
    p2: 'value'=0xc8000000000
    p3: 'where'=0x4c50001004f
  row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0
  min_blocked_time: 310 secs, waiter_cache_ver: 327
Blog: DBMS_SCHEDULER Job Not Running and Used Slaves has some further discussions on Oracle Kill Statement and UNIX Kill Command.

Wednesday, February 23, 2022

How to determine session's CURRVAL of an oracle sequence without error and without modify ?

Any reference to CURRVAL always returns the last reference to NEXTVAL in the current session. Each session caches its own previous call of NEXTVAL. CURRVAL is session private, whereas NEXTVAL is DB wide global and unique.

If you use CURRVAL without previous call of NEXTVAL, you will get error:
  ORA-08002: sequence SEQ.CURRVAL is not yet defined in this session
If you call NEXTVAL, it will modify the sequence.

So how can we determine session's CURRVAL of a sequence without error and without modify ?

The similar question is asked in:
How to retrieve the current value of an oracle sequence without increment it?

Note: Tested on Oracle 19c in Linux


First we create a sequence:

create sequence test_seq_1;
Then open a new Sqlplus session, make a session heapdump at level 29 (PGA, UGA, CGA, top call heaps, call heaps, session heap):

alter session set tracefile_identifier = 'without_currval';
alter session set events 'immediate trace name heapdump level 29';
In the dump, searching "SEQ CACHE", we found:

  7f0b13d3db18 sz=      504    cprm      "SEQ CACHE      "
Pick this address and make a heapdump_addr:

alter session set tracefile_identifier = 'seq_1';
alter session set events 'immediate trace name heapdump_addr level 2, addr 0x7f0b13d3db18';  
Then search around "addr 0x7f0b13d3db18", we can see the cached sequence: SYS.AUDSES$. There is no "SEQ CACHE" for test_seq_1.

7F0B13D3DB10 00000000 00000000 000001F9 00B38F00  [................]
7F0B13D3DB20 147F8A9C 00000000 1450F9A0 00007F0B  [..........P.....]
7F0B13D3DB30 1450F9A0 00007F0B 55410007 53455344  [..P.......AUDSES]
7F0B13D3DB40 00000024 00000000 00000000 00000000  [$...............]
7F0B13D3DB50 00000000 00000000 00000000 00000000  [................]
        Repeat 5 times
7F0B13D3DBB0 00000000 00000000 00070000 53445541  [............AUDS]
7F0B13D3DBC0 00245345 00000000 00000000 00000000  [ES$.............]
7F0B13D3DBD0 00000000 00000000 00000000 00000000  [................]
        Repeat 5 times
7F0B13D3DC30 00000000 00000000 00000000 59530003  [..............SY]
7F0B13D3DC40 00000053 00000000 00000000 00000000  [S...............]
7F0B13D3DC50 00000000 00000000 00000000 00000000  [................]
        Repeat 7 times
7F0B13D3DCD0 00000169 4EC4000C 00542706 00000000  [i......N.'T.....]
For SYS.AUDSES$, a session connected to an Oracle database may obtain one of its session identifiers, the Auditing Session ID (V$SESSION.AUDSID), by use of the built-in USERENV SQL function
(MOS Docu: How Sessions get Their AUDSID Identifier (Doc ID 122230.1)).

select AUDSID, dump(AUDSID, 16) AUDSID_dump, userenv('SESSIONID'), s.sid from v$session s where sid = sys.dbms_support.mysid;
 
    AUDSID  AUDSID_DUMP                   USERENV('SESSIONID')   SID
  --------  ----------------------------  -------------------- -----
  77053883  Typ=2 Len=5: c4,4e,6,27,54                77053883   189
In address line: 7F0B13D3DCD0, we can see AUDSID=77053883 in internal form (little endian): "4EC4000C 00542706".

Now we call test_seq_1.nextval:

select test_seq_1.nextval, dump(1, 16) dump from dual;
 
   NEXTVAL   DUMP
  -------   ------------------
        1   Typ=2 Len=2: c1,2
Make again a session heapdump:

alter session set tracefile_identifier = 'with_currval';
alter session set events 'immediate trace name heapdump level 29';
There is one entry about "SEQ CACHE" (SYS.AUDSES$ is no more cached, and its cache address is re-used).
The "SEQ CACHE" is located in "session heap".

HEAP DUMP heap name="session heap"  desc=0x7f0b14509568
  7f0b13d3db18 sz=      504    cprm      "SEQ CACHE      "
HEAP DUMP heap name="pga heap"  desc=0x7f0b192fd220
HEAP DUMP heap name="top call heap"  desc=0x7f0b192ffb40
HEAP DUMP heap name="top uga heap"  desc=0x7f0b192ffe00
HEAP DUMP heap name="callheap"  desc=0x7f0b192fe160
HEAP DUMP heap name="callheap"  desc=0x7f0b192fe160
Pick address: 7f0b13d3db18 and make a heapdump_addr:

alter session set tracefile_identifier = 'seq_2';
alter session set events 'immediate trace name heapdump_addr level 2, addr 0x7f0b13d3db18';
 
7F0B13D3DB10 00000000 00000000 000001F9 00B38F00  [................]
7F0B13D3DB20 147F8A9C 00000000 1450F9B0 00007F0B  [..........P.....]
7F0B13D3DB30 1450F9B0 00007F0B 4554000A 535F5453  [..P.......TEST_S]
7F0B13D3DB40 315F5145 00000000 00000000 00000000  [EQ_1............]
7F0B13D3DB50 00000000 00000000 00000000 00000000  [................]
        Repeat 5 times
7F0B13D3DBB0 00000000 00000000 000A0000 54534554  [............TEST]
7F0B13D3DBC0 5145535F 0000315F 00000000 00000000  [_SEQ_1..........]
7F0B13D3DBD0 00000000 00000000 00000000 00000000  [................]
        Repeat 5 times
7F0B13D3DC30 00000000 00000000 00000000 004B0001  [..............K.]
7F0B13D3DC40 00000000 00000000 00000000 00000000  [................]
        Repeat 7 times
7F0B13D3DCC0 96BA7940 00000000 906F3D38 00000000  [@y......8=o.....]
7F0B13D3DCD0 00472B44 02C10009 00542706 00000000  [D+G......'T.....]
In address line: 7F0B13D3DCD0, we can see test_seq_1.nextval=1 in internal form (little endian): "02C1".

From above dumps, we can see that each sequence used in the session has one "SEQ CACHE" for its currval with sz= 504.
The cached currval is stored from offset 444:

select to_number('7F0B13D3DCD0', 'XXXXXXXXXXXX') + 4 - to_number('7F0B13D3DB18', 'XXXXXXXXXXXX') offset from dual;
 
   OFFSET
  -------
      444

kdnnxt and kdncur Subroutines


The call stack for nextval and kdncur shows that internal function kdnnxt and kdncur are called.

--==== select test_seq_1.nextval from dual;
 
#0   kdnnxt ()
#1   qersqPopulate ()
#2   qersqRowProcedure ()
#3   qerfiFetch ()
#4   qersqFetch ()
#5   opifch2 ()
#6   kpoal8 ()
#7   opiodr ()
#8   ttcpip ()
#9   opitsk ()
#10  opiino ()
#11  opiodr ()
#12  opidrv ()
#13  sou2o ()
#14  opimai_real ()
#15  ssthrdmain ()
#16  main ()
 
 
--==== select test_seq_1.currval from dual;                          
                                
#0   kdncur ()
#1   qersqStart ()
#2   selexe0 ()
#3   opiexe ()
#4   kpoal8 ()
#5   opiodr ()
#6   ttcpip ()
#7   opitsk ()
#8   opiino ()
#9   opiodr ()
#10  opidrv ()
#11  sou2o ()
#12  opimai_real ()
#13  ssthrdmain ()
#14  main ()