Monday, September 1, 2025

Blog List

Oracle Database Stuck on DB Writer Wait Event: "oradebug request completion"

Blog:
    Perilous New Undocumented Tracing Feature in Oracle 21c
    oradebug request completion
revealed DBWR wait event "oradebug request completion" and made an in-depth investigation.

This Blog will demonstrate the reproducing of this event and explore further internals.

Note: Tested on Oracle 19.27 with:
   db_writer_processes = 3
   disk_asynch_io      = true
   aio-max-nr          = 1048576   (for 1048576/4096 = 256 Oracle processes in Linux)

1. Test Setup


1.1 Tablespace, table, index


We create a wide table of 100 rows on a slow file system with each data block containing one single row.

-- Create a file system of 10MB (20480 512-bytes sectors) with 5000 ms read/write delay
--   [root@test_db3] > echo "0 20480 delay /dev/loop3 0 5000" | dmsetup create dm-slow3
--   [root@test_db3] > mount /dev/mapper/dm-slow3 /home/oracle/suk/slow_dir3

drop tablespace test_tbs_slow3 including contents and datafiles cascade constraints;

create SMALLFILE tablespace test_tbs_slow3 datafile
  '/home/oracle/suk/slow_dir3/test_tbs_slow3.dbf' size 5M autoextend on next 1M maxsize 9M
logging online extent management local autoallocate blocksize 8k
segment space management auto flashback on;

  -- Elapsed: 00:02:28.63

drop table test_tab_slow3 cascade constraints;

create table test_tab_slow3(id number, text1 varchar2(3510), text2 varchar2(3510))
tablespace test_tbs_slow3 storage (initial 16k next 128k);

create index test_tab_slow3#id on test_tab_slow3(id) tablespace test_tbs_slow3;

insert into test_tab_slow3 select level id, rpad('ABC', 3000, 'X') text1, rpad('ABC', 3000, 'X') text2 from dual connect by level <= 100; 

commit;


1.2 Procedures


Then create job and "flush buffer_cache" procedures:

create or replace procedure update_row3(p_job_id number, p_loop_count number) as
begin
  for i in 1..p_loop_count loop
    update test_tab_slow3 set text1 = rpad('ABC', 3500, 'X')||1, text2 = rpad('ABC', 3500, 'X')||i where id between (p_job_id-1)*10 and p_job_id*10-1;
    --dbms_session.sleep(0.1);
    commit;
    --execute immediate 'alter system checkpoint';
  end loop;
end;
/

-- exec update_row3(1, 1);

create or replace procedure update_row_jobs3(p_job_count number, p_loop_count number) as
begin
  for i in 1..p_job_count loop
    dbms_scheduler.create_job (
      job_name        => 'TEST_JOB3_'||i,
      job_type        => 'PLSQL_BLOCK',
      job_action      => 'begin update_row3('||i||', '||p_loop_count||'); end;',    
      start_date      => systimestamp,
      --repeat_interval => 'systimestamp',
      auto_drop       => true,
      enabled         => true);
  end loop;
end;
/

--exec update_row_jobs3(10, 10000);
--clearup_test 

create or replace procedure flush_bc (p_cnt number) as
begin
  for i in 1..p_cnt loop
    execute immediate 'alter system flush buffer_cache';
  end loop;
end;
/

-- exec flush_bc(1000);


1.3 Slow I/O Test


Run following code with Sql Trace:

alter system flush buffer_cache;
alter session set tracefile_identifier = 'slow_trc_1';		
alter session set events '10046 trace name context forever, level 12';  
select * from test_tab_slow3 where id = 3;
alter session set events '10046 trace name context off';
Here the row trc and TKPROF file. There are two disk reads (one for index, one for table row), each "db file sequential read" takes 5 seconds.

*** 2025-08-28T10:47:16.022188+02:00
WAIT #140226820024472: nam='db file sequential read' ela= 5176667 file#=2607 block#=139 blocks=1 obj#=5938335 tim=8556907488219
*** 2025-08-28T10:47:21.141086+02:00
WAIT #140226820024472: nam='db file sequential read' ela= 5118663 file#=2607 block#=133 blocks=1 obj#=5938334 tim=8556912607124


SQL ID: c6b234hzgd4u9 Plan Hash: 745759405

select * from test_tab_slow3 where id = 3

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2      0.00      10.29          2          3          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4      0.00      10.29          2          3          0           1

Rows (1st)  Row Source Operation
----------  ---------------------------------------------------
         1  TABLE ACCESS BY INDEX ROWID BATCHED TEST_TAB_SLOW3 (cr=3 pr=2 pw=0 time=10295880 us starts=1 cost=1 size=337428 card=84)
         1   INDEX RANGE SCAN TEST_TAB_SLOW3#ID (cr=2 pr=1 pw=0 time=5177141 us starts=1 cost=1 size=0 card=34)(object id 5938335)

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                       2        0.00          0.00
  Disk file operations I/O                        1        0.00          0.00
  db file sequential read                         2        5.17         10.29
  SQL*Net message from client                     2        0.07          0.07
********************************************************************************


2. Test Run


2.1 Test Start


We launch 10 table update JOBs and then "flush buffer_cache" to force DB Writers to actively write data to disk.

declare
  l_job_cnt         number := 10;
  l_running_job_cnt number := 0;
begin
   update_row_jobs3(l_job_cnt, 100000);
   
   while l_running_job_cnt < l_job_cnt loop
     select count(*) into l_running_job_cnt from dba_scheduler_jobs where job_name like '%TEST_JOB%';
     dbms_session.sleep(1);
   end loop;
   dbms_session.sleep(4);
   flush_bc(1000);
end;
/


2.2 DBW0 SIGUSR2 signal Catching


We compose one GDB script and use it to catch SIGUSR2 on DBW0:

------------- gdb_dbw0_script_1.txt -------------
set pagination off
set logging file gdb_dbw0_script_1.log
set logging overwrite on
set logging on
set debug timestamp
set $sig_cnt = 0

# we can also use: catch signal SIGUSR2
break sigprocmask
commands
shell date
printf "===== sigprocmask (%i)=====\n", ++$sig_cnt
bt 8
shell sleep 30
cont
end
cont
Here the output when DBW0 receives SIGUSR2:

$ > gdb -x gdb_dbw0_script_1.txt -p 604093

Breakpoint 1 at 0x7fdfa1303630

Program received signal SIGUSR2, User defined signal 2.
0x0000000001f72ce9 in kcbblwr ()
(gdb) c
Continuing.

Breakpoint 1, 0x00007fdfa1303630 in sigprocmask () from /lib64/libc.so.6
Tue Aug 26 20:52:15 CEST 2025
===== sigprocmask (1)=====
#0  0x00007fdfa1303630 in sigprocmask () from /lib64/libc.so.6
#1  0x00000000035e2900 in sspuser ()
#2  
#3  0x0000000001f72ce9 in kcbblwr ()
#4  0x0000000001f5e5bd in kcbbdrv ()
#5  0x00000000137cdf55 in ksb_act_run_int ()
#6  0x00000000137cce04 in ksb_act_run ()
#7  0x000000000407b4be in ksbabs ()

Breakpoint 1, 0x00007fdfa1303630 in sigprocmask () from /lib64/libc.so.6
Tue Aug 26 20:52:45 CEST 2025
===== sigprocmask (2)=====
#0  0x00007fdfa1303630 in sigprocmask () from /lib64/libc.so.6
#1  0x00000000035e296d in sspuser ()
#2  
#3  0x0000000001f72ce9 in kcbblwr ()
#4  0x0000000001f5e5bd in kcbbdrv ()
#5  0x00000000137cdf55 in ksb_act_run_int ()
#6  0x00000000137cce04 in ksb_act_run ()
#7  0x000000000407b4be in ksbabs ()


2.3 DB alert.log Watching


When DBW0 (session ID 810 serial# 24481 OSPID 604093) receives SIGUSR2, DB hangs and becomes unresponsive with DBW0 as Final blocker.
Other DB writers (DBW1 and DBW2) shows "executing after 5 secs due to non-DBWR task".

$ > tail -f alert_test_db3.log

2025-08-26T20:38:35.835071+02:00
DIA0 Critical Database Process As Root: Hang ID 9 blocks 10 sessions
     Final blocker is session ID 810 serial# 24481 OSPID 604093 on Instance 1
     If resolvable, instance eviction will be attempted by Hang Manager
2025-08-26T20:38:40.848025+02:00
System State dumped to trace file /orabin/app/oracle/admin/test_db3/diag/rdbms/test_db3/test_db3/trace/test_db3_dia0_604091_base_1.trc
2025-08-26T20:39:23.354993+02:00
System State dumped to trace file /orabin/app/oracle/admin/test_db3/diag/rdbms/test_db3/test_db3/trace/test_db3_dia0_604091_lws_2.trc
2025-08-26T20:39:45.434724+02:00
System State dumped to trace file /orabin/app/oracle/admin/test_db3/diag/rdbms/test_db3/test_db3/trace/test_db3_dia0_604091_lws_2.trc
2025-08-26T20:40:24.557175+02:00
System state dump requested by (instance=1, osid=604097 (DBW1)), summary=[system state dump request (ksudss)].
System State dumped to trace file /orabin/app/oracle/admin/test_db3/diag/rdbms/test_db3/test_db3/trace/test_db3_diag_604074_20250826204024.trc
2025-08-26T20:40:24.561550+02:00
DBW 1 executing after 5 secs due to non-DBWR task at ts=1756233624
2025-08-26T20:40:33.925273+02:00
DBW 2 executing after 5 secs due to non-DBWR task at ts=1756233633
2025-08-26T20:46:13.941192+02:00


2.4 bpftrace tracing tgkill (sender, receiver and signal)


We trace tgkill (sender, receiver, signal) with following script.

A comment like "# DIA0 -> DBW0" is added in the output to denote "DIA0 sends SIGUSR2 (sig: 12) to DBW0".
(In the output, we only show one full callstack of first event).

"# DIA0 -> DBW0" is triggered by "oradebug short_stack" from DIA0 on DBW0.
If DBW0 is slow or there are a huge write, DIA0 session shows Wait Event: "oradebug request completion".
If it takes more than 5 seconds, v$system_event.total_timeouts increased 1.

We can see different pairs of (sender, receiver) like "DBW1 -> CKPT", "DBW0 -> DBW2".
It looks like some automatic diagnostic activities among Oracle processes.

$> bpftrace -e 'BEGIN {@seq = 0}
uprobe:/orabin/app/oracle/product/19.27.0.0.250415-212/bin/oracle:sskgp_tgkill+2
{
  time("%H:%M:%S --- ");
  printf("%d --- comm=%s pid=%d rsi(receiver_pid)=%d rdx(sig)=%d \n", @seq++, comm, pid, reg("si"), reg("dx"));
  printf("%s\n", ustack())
}'
Attaching 2 probes...

20:37:22 --- 65 --- comm=ora_dia0_test_d pid=604091 rsi(receiver_pid)=604093 rdx(sig)=12      # DIA0 -> DBW0
        sskgp_tgkill+2
        ksdx_cmdreq_issue+644
        ksdx_cmdreq_reissue_attempt+433
        ksdx_cmdreq_exec_track_debugee+517
        ksdx_cmdreq_exec+173
        ksdxdocmdmultex+1768
        ksdxdocmdmult+55
        ksudmp_proc_short_stack+373
        kjznlwshngsew+2725
        kjznlwstrigdmps+268
        kjznlwschksess+1303
        kjznrmnscanrmnsess+11216
        kjznrmnhmdetect+283
        kjznrmnhm+27
        kjznhm+1892
        kjzdregf+906
        kjzdm+691
        ksbrdp+1167
        opirip+541
        opidrv+581
        sou2o+165
        opimai_real+173
        ssthrdmain+417
        main+256
        __libc_start_main+243
        0x5541f689495641d7
        
20:37:37 --- 66 --- comm=ora_dia0_test_d pid=604091 rsi(receiver_pid)=620881 rdx(sig)=12      # DIA0 -> J008
20:40:19 --- 74 --- comm=ora_dbw1_test_d pid=604097 rsi(receiver_pid)=604109 rdx(sig)=12      # DBW1 -> CKPT
20:40:55 --- 94 --- comm=ora_diag_test_d pid=604074 rsi(receiver_pid)=604097 rdx(sig)=12      # DIAG -> DBW1
...
20:51:53 --- 167 --- comm=ora_mmon_test_d pid=604139 rsi(receiver_pid)=604043 rdx(sig)=12     # MMON -> PMON
...
20:53:09 --- 184 --- comm=ora_dia0_test_d pid=604091 rsi(receiver_pid)=604139 rdx(sig)=12     # DIA0 -> MMON
20:53:29 --- 186 --- comm=ora_dbw0_test_d pid=604093 rsi(receiver_pid)=604097 rdx(sig)=12     # DBW0 -> DBW1     
20:53:30 --- 187 --- comm=ora_dbw0_test_d pid=604093 rsi(receiver_pid)=604101 rdx(sig)=12     # DBW0 -> DBW2
20:53:31 --- 202 --- comm=ora_diag_test_d pid=604074 rsi(receiver_pid)=604091 rdx(sig)=12     # DIAG -> DIA0


2.5 bpftrace on DB Writers


We use following script to trace DBWR signal catching (only full callstack are showed for first two events).
It shows DBWR receive signals.

[root@test_db3] >bpftrace -e 'BEGIN {@seq = 0}
tracepoint:syscalls:sys_enter_rt_sigprocmask /strncmp("ora_dbw", comm, 7) == 0/
{
  time("%H:%M:%S --- ");
  printf("%d --- comm=%s pid=%d \n", @seq++, comm, pid);
  printf("%s\n", ustack())
}'
Attaching 2 probes...

19:53:53 --- 0 --- comm=ora_dbw1_test_d pid=604097
        __GI___sigprocmask+17
        0x7f80476f1c20
        skgpwwait+187
        ksliwat+2293
        kslwaitctx+200
        ksarcv+376
        ksbabs+670
        ksbrdp+1167
        opirip+541
        opidrv+581
        sou2o+165
        opimai_real+173
        ssthrdmain+417
        main+256
        __libc_start_main+243
        0x5541f689495641d7
19:53:53 --- 1 --- comm=ora_dbw1_test_d pid=604097
        pthread_sigmask+74
        ksdxcb+499
        sspuser+223
        0x7f80476f1c20
        skgpwwait+187
        ksliwat+2293
        kslwaitctx+200
        ksarcv+376
        ksbabs+670
        ksbrdp+1167
        opirip+541
        opidrv+581
        sou2o+165
        opimai_real+173
        ssthrdmain+417
        main+256
        __libc_start_main+243
        0x5541f689495641d7
20:40:24 --- 3 --- comm=ora_dbw2_test_d pid=604101
        __GI___sigprocmask+17
20:40:30 --- 4 --- comm=ora_dbw2_test_d pid=604101
        pthread_sigmask+74
20:52:45 --- 15 --- comm=ora_dbw0_test_d pid=604093
        __GI___sigprocmask+17
20:52:50 --- 16 --- comm=ora_dbw0_test_d pid=604093
        pthread_sigmask+74


2.6 wait_event "oradebug request completion" systemstate dump


During test, we also set event to generate a system-state trace when a wait of bigger than 10ms on 'oradebug request completion' is detected:

ALTER SYSTEM SET events = 'wait_event["oradebug request completion"]{wait: minwait=10000} systemstate(266)';
--ALTER SYSTEM SET events = 'wait_event["oradebug request completion"]{wait: minwait=10000} off';


SQL> oradebug setmypid
Statement processed.
SQL> oradebug eventdump system
wait_event ["oradebug request completion"] {wait: minwait=10000} systemstate(266)
In DB alert.log, we can see the line:

System State dumped to trace file /orabin/app/oracle/admin/test_db3/diag/rdbms/test_db3/test_db3/trace/test_db3_diag_604074_20250826204024.trc
Open "test_db3_diag_604074_20250826204024.trc", it shows DB is blocked by DBW0 (sid 810) and CKPT (sid: 811):

--------------- test_db3_diag_604074_20250826204024.trc ---------------

Unix process pid: 604074, image: oracle@test_db3 (DIAG)
*** 2025-08-26T20:40:24.557417+02:00

System state caller stack:
Short stack dump:
ksedsts()+426<-ksdxfstk()+63<-ksdxdocmdmultex()+5627<-ksdxdocmdmult()+55<-ksudmp_proc_short_stack()+373<-ksudss_opt()+307<-kjzddss()+545<-kjzdmdrq()+6930<-kjzdregf()+4183<-kjzdm()+
691<-ksbrdp()+1167<-opirip()+541<-opidrv()+581<-sou2o()+165<-opimai_real()+173<-ssthrdmain()+417<-main()+256<-__libc_start_main()+243<-_start()+46
===================================================
SYSTEM STATE (level=10, with short stacks)
------------
Dumping the Process Summary

       8: USER ospid 619494 sid 26 ser 53200, waiting for 'DLM cross inst call completion'
          Cmd: PL/SQL EXECUTE
          Blocked by inst: 1, sid: 811, ser: 4896
          Final Blocker inst: 1, sid: 811, ser: 4896
      
      11: DIAG ospid 604074 sid 808 ser 30814,

      18: DIA0 ospid 604091 sid 541 ser 16758, waiting for 'DIAG idle wait'
      19: DBW0 ospid 604093 sid 810 ser 24481,
      20: DBW1 ospid 604097 sid 5 ser 1428, waiting for 'oradebug request completion'
      21: DBW2 ospid 604101 sid 273 ser 22218, waiting for 'rdbms ipc message'
      22: LGWR ospid 604105 sid 542 ser 2191, waiting for 'rdbms ipc message'
      23: CKPT ospid 604109 sid 811 ser 4896, waiting for 'rdbms ipc message'
      24: LG00 ospid 604113 sid 6 ser 7534, waiting for 'LGWR worker group idle'
      25: SMON ospid 604115 sid 274 ser 52613, waiting for 'enq: CR - block range reuse ckpt'
          Blocked by inst: 1, sid: 811, ser: 4896
          Final Blocker inst: 1, sid: 811, ser: 4896
      26: LG01 ospid 604119 sid 543 ser 61931, waiting for 'LGWR worker group idle'
      
      56: J000 ospid 620865 sid 21 ser 42342, waiting for 'log file switch (checkpoint incomplete)'
          Cmd: UPDATE
          Blocked by inst: 1, sid: 810, ser: 24481
          Final Blocker inst: 1, sid: 810, ser: 24481
      57: J001 ospid 620867 sid 292 ser 4060, waiting for 'log file switch (checkpoint incomplete)'
          Cmd: UPDATE
          Blocked by inst: 1, sid: 810, ser: 24481
          Final Blocker inst: 1, sid: 810, ser: 24481
      58: J002 ospid 620869 sid 560 ser 27312, waiting for 'log file switch (checkpoint incomplete)'
          Cmd: UPDATE
          Blocked by inst: 1, sid: 810, ser: 24481
          Final Blocker inst: 1, sid: 810, ser: 24481

      66: J00A ospid 620994 sid 561 ser 35509, waiting for 'log file switch (checkpoint incomplete)'
          Cmd: UPDATE
          Blocked by inst: 1, sid: 810, ser: 24481
          Final Blocker inst: 1, sid: 810, ser: 24481
      68: M002 ospid 618975 sid 22 ser 30742, waiting for 'log file switch (checkpoint incomplete)'
          Cmd: UPDATE
          Blocked by inst: 1, sid: 810, ser: 24481
          Final Blocker inst: 1, sid: 810, ser: 24481
      69: MZ00 ospid 620998 sid 295 ser 41723, waiting for 'log file switch (checkpoint incomplete)'
          Cmd: UPSERT
          Blocked by inst: 1, sid: 810, ser: 24481
          Final Blocker inst: 1, sid: 810, ser: 24481
      70: M004 ospid 618953 sid 557 ser 23163, waiting for 'log file switch (checkpoint incomplete)'
          Cmd: UPSERT
          Blocked by inst: 1, sid: 810, ser: 24481
          Final Blocker inst: 1, sid: 810, ser: 24481
In another test, we got following DB alert.log:

2025-08-25T10:50:32.827370+02:00
DIA0 Critical Database Process As Root: Hang ID 1 blocks 11 sessions
     Final blocker is session ID 810 serial# 31793 OSPID 601523 on Instance 1
     If resolvable, instance eviction will be attempted by Hang Manager
2025-08-25T10:50:37.839966+02:00
System State dumped to trace file /orabin/app/oracle/admin/test_db3/diag/rdbms/test_db3/test_db3/trace/test_db3_dia0_601521_base_1.trc
2025-08-25T10:50:39.601999+02:00
Errors in file /orabin/app/oracle/admin/test_db3/diag/rdbms/test_db3/test_db3/trace/test_db3_dia0_601521_base_1.trc:
ORA-32515: cannot issue ORADEBUG command 'SHORT_STACK' to process 'Unix process pid: 601523, image: oracle@test_db3 (DBW0)'; prior command execution time exceeds 5000 ms
2025-08-25T10:50:44.283104+02:00
DIA0 Critical Database Process As Root: Hang ID 2 blocks 3 sessions
     Final blocker is session ID 811 serial# 20147 OSPID 601539 on Instance 1
     If resolvable, instance eviction will be attempted by Hang Manager
2025-08-25T10:52:10.789009+02:00
System state dump requested by (instance=1, osid=601531 (DBW2)), summary=[system state dump request (ksudss)].
2025-08-25T10:52:10.795339+02:00
Errors in file /orabin/app/oracle/admin/test_db3/diag/rdbms/test_db3/test_db3/trace/test_db3_dbw2_601531.trc:
ORA-32515: cannot issue ORADEBUG command 'SHORT_STACK' to process 'Unix process pid: 601523, image: oracle@test_db3 (DBW0)'; prior command execution time exceeds 5000 ms
2025-08-25T10:52:10.805672+02:00
Errors in file /orabin/app/oracle/admin/test_db3/diag/rdbms/test_db3/test_db3/trace/test_db3_dbw1_601527.trc:
ORA-32515: cannot issue ORADEBUG command 'SHORT_STACK' to process 'Unix process pid: 601523, image: oracle@test_db3 (DBW0)'; prior command execution time exceeds 5000 ms
System State dumped to trace file /orabin/app/oracle/admin/test_db3/diag/rdbms/test_db3/test_db3/trace/test_db3_diag_601504_20250825105210.trc
2025-08-25T10:52:11.546767+02:00
DBW 1 executing after 6 secs due to non-DBWR task at ts=1756111931
2025-08-25T10:52:11.646360+02:00
DBW 2 executing after 6 secs due to non-DBWR task at ts=1756111931
It shows "ORADEBUG command 'SHORT_STACK' execution time exceeds 5000 ms":

ORA-32515: cannot issue ORADEBUG command 'SHORT_STACK' to process 'Unix process pid: 601523, image: oracle@test_db3 (DBW0)'; prior command execution time exceeds 5000 ms
"5000 ms" seems hard-coded, if we look kjznlwshngsew (caller of sskgp_tgkill), we can see (5000 = 0x1388):

   0x0000000003357d4b <+2715>:	mov    $0x1388,%edx
   0x0000000003357d50 <+2720>:	callq  0x13570f0 
Open "test_db3_diag_601504_20250825105210.trc", the systemstate dump shows DB blocked by DBW0 (sid 810):

----------- test_db3_diag_601504_20250825105210.trc -----------

      19: DBW0 ospid 601523 sid 810 ser 31793, waiting for 'db file parallel write'
      20: DBW1 ospid 601527 sid 5 ser 51041,
      21: DBW2 ospid 601531 sid 273 ser 53811,
      22: LGWR ospid 601535 sid 542 ser 33370, waiting for 'rdbms ipc message'
      23: CKPT ospid 601539 sid 811 ser 20147, waiting for 'rdbms ipc message'

      34: MMON ospid 601569 sid 545 ser 42846, waiting for 'log file switch (checkpoint incomplete)'
          Blocked by inst: 1, sid: 810, ser: 31793
          Final Blocker inst: 1, sid: 810, ser: 31793

      38: USER ospid 601876 sid 554 ser 30768, waiting for 'log file switch (checkpoint incomplete)'
          Cmd: UPDATE
          Blocked by inst: 1, sid: 810, ser: 31793
          Final Blocker inst: 1, sid: 810, ser: 31793
      
      47: W003 ospid 601811 sid 816 ser 37737, waiting for 'log file switch (checkpoint incomplete)'
          Blocked by inst: 1, sid: 810, ser: 31793
          Final Blocker inst: 1, sid: 810, ser: 31793

      57: USER ospid 601960 sid 290 ser 217, waiting for 'DLM cross inst call completion'
          Cmd: PL/SQL EXECUTE
          Blocked by inst: 1, sid: 811, ser: 20147
          Final Blocker inst: 1, sid: 811, ser: 20147
      58: M002 ospid 601840 sid 555 ser 44546, waiting for 'class slave wait'
      59: J002 ospid 601942 sid 824 ser 55525, waiting for 'log file switch (checkpoint incomplete)'
          Cmd: UPDATE
          Blocked by inst: 1, sid: 810, ser: 31793
          Final Blocker inst: 1, sid: 810, ser: 31793
      60: J000 ospid 601938 sid 20 ser 28041, waiting for 'enq: US - contention'
          Cmd: UPDATE
          Blocked by inst: 1, sid: 826, ser: 1291
          Final Blocker inst: 1, sid: 810, ser: 31793


2.7 Session 'oradebug request completion' Stats



select localtimestamp, event, total_waits, total_timeouts, round(time_waited_micro/total_waits) ms_per_wait, round(time_waited_micro/1e6) wait_seconds --,  v.* 
from v$system_event v where lower(event) = 'oradebug request completion';

LOCALTIMESTAMP         EVENT                        TOTAL_WAITS TOTAL_TIMEOUTS MS_PER_WAIT WAIT_SECONDS
---------------------- ---------------------------- ----------- -------------- ----------- ------------
27-AUG-2025 12:14:45   oradebug request completion         2507             67     1006642         2524

select sample_time, program, session_id, session_serial#, event 
from v$active_session_history t 
where sample_time between timestamp'2025-8-26 20:38:40' and timestamp'2025-8-26 20:40:35'
  and event in ('oradebug request completion')
order by t.sample_time;

SAMPLE_TIME             PROGRAM                  SESSION_ID SESSION_SERIAL# EVENT
----------------------- ------------------------ ---------- --------------- ---------------------------
26-AUG-2025 20:38:40    oracle@test_db3 (DIA0)          541           16758 oradebug request completion
26-AUG-2025 20:40:19    oracle@test_db3 (DBW1)            5            1428 oradebug request completion
26-AUG-2025 20:40:20    oracle@test_db3 (DBW1)            5            1428 oradebug request completion
26-AUG-2025 20:40:21    oracle@test_db3 (DBW1)            5            1428 oradebug request completion
26-AUG-2025 20:40:22    oracle@test_db3 (DBW1)            5            1428 oradebug request completion
26-AUG-2025 20:40:23    oracle@test_db3 (DBW1)            5            1428 oradebug request completion
26-AUG-2025 20:40:24    oracle@test_db3 (DIAG)          808           30814 oradebug request completion
26-AUG-2025 20:40:25    oracle@test_db3 (DIAG)          808           30814 oradebug request completion
26-AUG-2025 20:40:26    oracle@test_db3 (DIAG)          808           30814 oradebug request completion
26-AUG-2025 20:40:27    oracle@test_db3 (DIAG)          808           30814 oradebug request completion
26-AUG-2025 20:40:28    oracle@test_db3 (DIAG)          808           30814 oradebug request completion
26-AUG-2025 20:40:29    oracle@test_db3 (DBW2)          273           22218 oradebug request completion
26-AUG-2025 20:40:29    oracle@test_db3 (DIAG)          808           30814 oradebug request completion
26-AUG-2025 20:40:30    oracle@test_db3 (DIAG)          808           30814 oradebug request completion
26-AUG-2025 20:40:30    oracle@test_db3 (DBW2)          273           22218 oradebug request completion
26-AUG-2025 20:40:31    oracle@test_db3 (DIAG)          808           30814 oradebug request completion
26-AUG-2025 20:40:31    oracle@test_db3 (DBW2)          273           22218 oradebug request completion
26-AUG-2025 20:40:32    oracle@test_db3 (DIAG)          808           30814 oradebug request completion
26-AUG-2025 20:40:32    oracle@test_db3 (DBW2)          273           22218 oradebug request completion
26-AUG-2025 20:40:33    oracle@test_db3 (DBW2)          273           22218 oradebug request completion
26-AUG-2025 20:40:33    oracle@test_db3 (DIAG)          808           30814 oradebug request completion
26-AUG-2025 20:40:34    oracle@test_db3 (DIAG)          808           30814 oradebug request completion


2.8 Session Process Info


Here the session info ( SID, SERIAL#, PID, SPID) in our test:

select s.program, event, s.sid, s.serial#, pid, spid  
  from v$session s, v$process p 
 where s.paddr=p.addr and (lower(s.program) like '%(dia%' or lower(s.program) like '%(dbw%' 
       or lower(s.program) like '%(ckpt%' or lower(s.program) like '%(lg%' or lower(s.program) like '%(j0%');

PROGRAM                   EVENT                        SID  SERIAL#   PID SPID
------------------------- -------------------------- ----- -------- ----- ------
oracle@test_db3 (DIAG)    DIAG idle wait               808    30814    11 604074
oracle@test_db3 (DIA0)    DIAG idle wait               541    16758    18 604091
oracle@test_db3 (DBW0)    db file parallel write       810    24481    19 604093
oracle@test_db3 (DBW1)    db file parallel write         5     1428    20 604097
oracle@test_db3 (DBW2)    db file parallel write       273    22218    21 604101
oracle@test_db3 (LGWR)    rdbms ipc message            542     2191    22 604105
oracle@test_db3 (CKPT)    rdbms ipc message            811     4896    23 604109
oracle@test_db3 (LG00)    LGWR worker group idle         6     7534    24 604113
oracle@test_db3 (LG01)    LGWR worker group idle       543    61931    26 604119
oracle@test_db3 (J000)    db file sequential read       21    42342    56 620865
oracle@test_db3 (J001)    free buffer waits            292     4060    57 620867
oracle@test_db3 (J002)    db file sequential read      560    27312    58 620869
oracle@test_db3 (J003)    free buffer waits            829    13992    59 620871
oracle@test_db3 (J004)    db file sequential read       20    48258    60 620873
oracle@test_db3 (J005)    db file sequential read      289     7069    61 620875
oracle@test_db3 (J006)    db file sequential read      562    16826    62 620877
oracle@test_db3 (J007)    db file sequential read      830    19629    63 620879
oracle@test_db3 (J008)    db file sequential read        2    61229    64 620881
oracle@test_db3 (J009)    db file sequential read      291     8814    65 620883
oracle@test_db3 (J00A)    jobq slave wait              561    24036    66 620885


3. Observing "oradebug request completion"


With following small test, we can watch "oradebug request completion". Open two Sqlplus sessions (PID 36 and 59):

select s.last_call_et, s.program, event, s.sid, s.serial#, pid, spid  
  from v$session s, v$process p 
 where s.paddr=p.addr and lower(s.program) like '%sqlplus%' order by 1;
 
LAST_CALL_ET PROGRAM                        EVENT                         SID  SERIAL#  PID SPID
------------ ------------------------------ ---------------------------- ---- -------- ---- ------
          55 sqlplus.exe                    SQL*Net message from client    19    50486   36 613512
         639 sqlplus@test_db3 (TNS V1-V3)   SQL*Net message from client   829    46358   59 623458
Start GDB to catch signal SIGUSR2 (or "break sigprocmask") on PID: 36 (SPID 613512)

$> gdb -p 613512

(gdb) catch signal SIGUSR2
Catchpoint 1 (signal SIGUSR2)
(gdb) handle SIGUSR2
Signal        Stop      Print   Pass to program Description
SIGUSR2       Yes       Yes     Yes             User defined signal 2
(gdb) c
From PID 59, we run "oradebug short_stack" on PID 36:
  
SQL> oradebug setorapid 36
Oracle pid: 36, Unix process pid: 613512, image: oracle@test_db3
SQL> oradebug short_stack
GDB shows PID: 36 Catchpoint 1 (signal SIGUSR2) reached:

(gdb) catch signal SIGUSR2
Catchpoint 1 (signal SIGUSR2)
(gdb) handle SIGUSR2
Signal        Stop      Print   Pass to program Description
SIGUSR2       Yes       Yes     Yes             User defined signal 2
(gdb) c
Continuing.

Catchpoint 1 (signal SIGUSR2), 0x00007f53174d79b2 in read () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f53174d79b2 in read () from /lib64/libpthread.so.0
#1  0x0000000013a74cb4 in read ()
#2  0x0000000013a70950 in snttread ()
#3  0x00000000065a25f5 in nttrd ()
#4  0x00000000064afe40 in nsprecv ()
#5  0x00000000064b8632 in nsrdr ()
#6  0x0000000012bcf1d5 in nsfull_pkt_rcv ()
#7  0x0000000012bd688e in nsfull_brc ()
#8  0x0000000013a5f886 in nioqrc ()
#9  0x00000000135c893a in opikndf2 ()
#10 0x00000000029739c2 in opitsk ()
#11 0x0000000002979ccb in opiino ()
#12 0x00000000135cb7a0 in opiodr ()
#13 0x000000000296fa86 in opidrv ()
#14 0x00000000035da375 in sou2o ()
#15 0x0000000000e0b566 in opimai_real ()
#16 0x00000000035e7011 in ssthrdmain ()
#17 0x0000000000e0b390 in main ()
Now if we look Session (PID 59), it is waiting on "oradebug request completion" because it sends oradebug request on PID 36,
which is pending to continuation, hence PID 59 waiting for completion.

select s.last_call_et, s.program, event, s.sid, s.serial#, pid, spid  
  from v$session s, v$process p 
 where s.paddr=p.addr and lower(s.program) like '%sqlplus%' order by 1;

LAST_CALL_ET PROGRAM                        EVENT                           SID   SERIAL#  PID SPID
------------ ------------------------------ ---------------------------- ------ --------- ---- ------
         191 sqlplus.exe                    SQL*Net message from client      19     50486   36 613512
         775 sqlplus@test_db3 (TNS V1-V3)   oradebug request completion     829     46358   59 623458
By the way, we also observe "oradebug request completion" in Oracle 19.9.


4. strace on DBW0


With strace, we can watch asynchronous I/O (io_submit and io_getevents):

$> strace -t -p 604093

11:29:40 io_submit(140598410809344, 5, 
            [{data=0x7fdf9f22a738, pwrite, fildes=261, str="\2\242\0\0Z\2\300\0\202\377\216\0355\f\1\4\334\225\0\0\f\0\33\0| \7\0\35E\16\16"..., nbytes=40960, offset=4931584}, 
             {data=0x7fdf9ef17108, pwrite, fildes=261, str="\2\242\0\0qt\301\0\210\377\216\0355\f\2\4I\335\0\0\n\0!\0M\306\n\0`\343\2\2"..., nbytes=40960, offset=781066240}, 
             {data=0x7fdf9eed1758, pwrite, fildes=261, str="\2\242\0\0)\340\323\0J\377\216\0355\f\6\4^\\\0\0\4\0\35\0\210n\10\0\225\210\10\10"..., nbytes=8192, offset=10670645248}, 
             {data=0x7fdf9ee2ae38, pwrite, fildes=261, str="\2\242\0\0I\273\336\0k\377\216\0355\f\1\4\254|\0\0\t\0\27\0\37\373\7\0\272\204\2\2"..., nbytes=8192, offset=16498892800}, 
             {data=0x7fdf9eea4658, pwrite, fildes=2865, str="\6\242\0\0\221\0@\214\227\377\216\0355\f\2\6o$\0\0\1\0\0\0\236\234Z\0\226\377\216\35"..., nbytes=32768, offset=1187840}]) = 5
11:29:40 io_getevents(140598410809344, 1, 128, 
            [{data=0x7fdf9ee2ae38, obj=0x7fdf9ee2ae38, res=8192, res2=0}, 
             {data=0x7fdf9eed1758, obj=0x7fdf9eed1758, res=8192, res2=0}, 
             {data=0x7fdf9ef17108, obj=0x7fdf9ef17108, res=40960, res2=0}, 
             {data=0x7fdf9f22a738, obj=0x7fdf9f22a738, res=40960, res2=0}], 
             {600, 0}) = 4
11:29:40 io_getevents(140598410809344, 1, 128, 
            [{data=0x7fdf9eea4658, obj=0x7fdf9eea4658, res=32768, res2=0}], 
             {600, 0}) = 1
          
$ > ./lsof -p 604093

  ora_dbw0_ 604093 oracle  261uW  REG     8,1 34359730176  688992556 /oratest_db3/oradata/test_db3/undo01.dbf
  ora_dbw0_ 604093 oracle 2865uW  REG   253,2     5251072         12 /home/oracle/suk/slow_dir3/test_tbs_slow3.dbf
Above io_submit sends 5 pwrite requests (4 on UNOD file (fildes=261), one on tablespace file (fildes=2865) of table test_tab_slow3),
and calls twice io_getevents to read asynchronous I/O events.
(io_submit and io_getevents are matched by "data" and "res" fields. Their order can be different due to asynchronous)

With asynchronous I/O, DBWR calls io_submit to queue I/O requests, kworkers handle actual I/O requests, once I/O operation finishes,
DBWR is signalled and read asynchronous I/O events from the completion queue by io_getevents.
Hence DBWR seems idle even there are heavy I/O activities.

Often "oradebug request completion" is cause by unusually high storage latency or hardware issues, indicating a performance bottleneck.
In AWR reports, we can see those outliers (Av Rd/ Av Writes) of "db file async I/O submit", "db file parallel write" and IOStat sections
(by function, filetype, tablespace,file).

If Linux AIO-MAX-NR is set too (AIO-MAX-NR=131072 AIO-NR=13104), following WARNING wrote into DB Writer trc and DB Performance deteriorated sluggishly. (If currently active AIO number: aio-nr reaches aio-max-nr then io_setup will fail with EAGAIN):

WARNING:1 Oracle process running out of OS kernelI/O resources request_aiolimit=4096 set_aiolimit=0
 
WARNING:asynch I/O kernel limits is set at AIO-MAX-NR=131072 AIO-NR=13104
We can also monitor them by querying v$io_outlier:

select function_name, file_name, disk1_name, count(*) cnt, sum(io_latency) io_latency_SUM, round(avg(io_latency)) io_latency_AVG
  from v$io_outlier where disk1_name = 'dm-3' group by function_name, file_name, disk1_name order by file_name, disk1_name;

FUNCTION_NAME        FILE_NAME                                          DISK1_NAME        CNT IO_LATENCY_SUM IO_LATENCY_AVG
-------------------- -------------------------------------------------- ---------- ---------- -------------- --------------
Buffer Cache Reads   /home/oracle/suk/slow_dir3/test_tbs_slow3.dbf      dm-3              156         816345           5233
DBWR                 /home/oracle/suk/slow_dir3/test_tbs_slow3.dbf      dm-3               94        1127662          11996
Direct Reads         /home/oracle/suk/slow_dir3/test_tbs_slow3.dbf      dm-3                4          25645           6411
Direct Writes        /home/oracle/suk/slow_dir3/test_tbs_slow3.dbf      dm-3                1          10239          10239
Others               /home/oracle/suk/slow_dir3/test_tbs_slow3.dbf      dm-3                4          21308           5327
We can get full aio_write call stack by:

$> perf record -e 'workqueue:*' -ag -T

$> perf report

-- Type "Enter" on "19K workqueue:workqueue_queue_work", type "/ Filter symbol by name", then give "io_submit_one"

Samples: 19K of event 'workqueue:workqueue_queue_work', Event count (approx.): 19945
  Children      Self  Command          Shared Object      Symbol
-    5.85%     0.00%  ora_dbw0_test_d  [kernel.kallsyms]  [k] io_submit_one                           
   - io_submit_one                                                                                    
      - 5.77% aio_write                                                                               
         - 5.64% xfs_file_write_iter                                                                  
            - 5.63% xfs_file_dio_write_aligned                                                        
               - 5.37% iomap_dio_rw                                                                   
                  - __iomap_dio_rw                                                                    
                     - 5.07% iomap_apply                                                              
                        - 4.91% iomap_dio_bio_actor                                                   
                           - 4.37% iomap_dio_submit_bio.isra.23                                       
                              - 4.33% submit_bio                                                      
                                 - generic_make_request                                               
                                    - 4.16% blk_mq_make_request                                       
                                       - 2.76% blk_flush_plug_list                                    
                                          - blk_mq_flush_plug_list                                    
                                             - 2.76% blk_mq_sched_insert_requests                     
                                                - 2.70% __blk_mq_delay_run_hw_queue                   
                                                   - 2.70% __blk_mq_run_hw_queue                      
                                                        blk_mq_sched_dispatch_requests                
                                                        __blk_mq_sched_dispatch_requests              
                                                      - __blk_mq_do_dispatch_sched                    
                                                         - 2.57% blk_mq_dispatch_rq_list              
                                                            - 2.56% scsi_queue_rq                     
                                                               - 2.32% pvscsi_queue                   
                                                                    _raw_spin_unlock_irqrestore       
                                                                    ret_from_intr                     
                                                                    do_IRQ                            
                                                                    irq_exit                          
                                                                    irq_exit_rcu                      
                                                                    __softirqentry_text_start         
                                                                    blk_complete_reqs                 
                                                                    scsi_io_completion                
                                                                  - scsi_end_request                  
                                                                     - 2.31% blk_update_request       
                                                                          iomap_dio_bio_end_io        
                                                                          queue_work_on               
                                                                          __queue_work                
                                                                          __queue_work                
                                       - 1.11% __blk_mq_alloc_request                                 
                                          - 1.11% blk_mq_get_tag                                      
                                             - 1.07% io_schedule                                      
                                                  io_schedule_prepare                                 
                                                  blk_flush_plug_list                                 
                                                  blk_mq_flush_plug_list                              
                                                  blk_mq_sched_insert_requests                        
                                                  kblockd_mod_delayed_work_on                         
                                                - mod_delayed_work_on                                 
                                                   - 1.07% __queue_work                               
                                                        __queue_work                                  
With following bpftrace script, we can also get similar info as strace:

bpftrace -e 'tracepoint:syscalls:sys_enter_io_submit /pid==604093/ {
              $iocb_sizeof = sizeof(struct iocb);
              $reqs = args->nr;
              $iocbpp_ptr = (struct iocb **)args->iocbpp;
              time("%H:%M:%S --- ");
              printf("%s, iocb_sizeof: %d, io_submit I/O requests: %d\n", comm, $iocb_sizeof, $reqs);
              $i = 1; while ($i <= $reqs) { $iocbpp_ptr_2 = (struct iocb **)($iocbpp_ptr + $iocb_sizeof*($i-1));
                                            $iocbpp_struct = (struct iocb *)(*$iocbpp_ptr_2);
                                            printf("--- File: %d --- ptr_2 %p, aio_fildes: %d, aio_nbytes: %d, aio_offset: %ld, aio_lio_opcode(1=PWRITE): %d, aio_data:0X%X\n",
                                                   $i, $iocbpp_ptr_2, $iocbpp_struct->aio_fildes, $iocbpp_struct->aio_nbytes, $iocbpp_struct->aio_offset, $iocbpp_struct->aio_lio_opcode, $iocbpp_struct->aio_data);
                                            @All_CNT = count(); @CNT[$iocbpp_struct->aio_fildes] = count(); @SUM[$iocbpp_struct->aio_fildes] = sum( $iocbpp_struct->aio_nbytes);
                                            if ($i > 50) { break; }
                                            $i++ }
              printf("\n");}
              tracepoint:syscalls:sys_exit_io_getevents /pid==604093/ {time("%H:%M:%S --- "); printf("-- 1 -- io_getevents Ret: %d\n", args->ret); @Ret_CNT = sum(args->ret);
              }'


5. ORA-32515 - ORA-32519


Above DB alert.log shows ORA-32515. Here some ORADEBUG related errors:

oracle@test_db3:test_db3 > oerr ora 32515
32515, 00000, "cannot issue ORADEBUG command '%s' to process '%s'; prior command execution time exceeds %s ms"
// *Cause:  The process targeted to execute the ORADEBUG command was
//          busy executing another ORADEBUG command for a time greater
//          than the timeout value specified.
// *Action: Increase the timeout value.

oracle@test_db3:test_db3 > oerr ora 32516
32516, 00000, "cannot wait for process '%s' to finish executing ORADEBUG command '%s'; wait time exceeds %s ms"
// *Cause:  The execution of the ORADEBUG command took longer than the
//          the timeout value specified.
// *Action: Increase the timeout value.

oracle@test_db3:test_db3 > oerr ora 32517
32517, 00000, "cannot issue ORADEBUG command '%s' to process '%s' (prior command execution time exceeds %s ms);  total wait time exceeds %s ms"
// *Cause:  The process targeted to execute the ORADEBUG command was
//          busy executing another ORADEBUG command. But, because
//          the total wait time for all targeted processes
//          exceeded the maximum wait time, the ORADEBUG command was
//          not issued to the target process.
// *Action: Increase the timeout value.

oracle@test_db3:test_db3 > oerr ora 32518
32518, 00000, "cannot wait for process '%s' to finish executing ORADEBUG command '%s' (waited %s ms); total wait time exceeds %s ms"
// *Cause:  The total wait time for all targeted processes exceeded the
//          maximum wait time, therefore the wait for the targeted process
//          to finish executing the ORADEBUG command was aborted.
// *Action: Increase the timeout value.

oracle@test_db3:test_db3 > oerr ora 32519
32519, 00000, "insufficient privileges to execute ORADEBUG command: %s"
// *Cause: Sufficient privileges were not granted.
// *Action: Perform the operation using a database session that
//          has the required privileges. See the Oracle ORADEBUG
//          documentation for more details.
ORADEBUG probably has some changes, for example,
Oracle 19.19 Patch:
     Patch 25293381: RUNNING "ORADEBUG SETTRACEFILEID TEST" CAUSES ORA-32522: [KGHALO BAD HEAP DS]

And we also noticed that from Oracle 19.26, DBW0 generates a trc file of more than 1 MB.