A few days ago, an AQ based Oracle application (Oracle 11.2.0.3.0) threw a deadlock after more than 1,000,000 times of unobstructed running since a few years.
The incident is frustrating since the reliability of AQ based implementation is disputed, and resorting to Oracle appears being a long way to go.
This demonstrates again that correctness proof of computer programs is hardly achievable, let alone concurrent programs.
The deadlock is caused by the following simplified pseudo code:
dbms_aqadm.start_queue;
enqueue some message;
dbms_job.submit to launch a dequeue job in background; -- can also be dbms_scheduler
wait till queue is empty;
dbms_aqadm.stop_queue;
which is started in main session and concurrently forks background job session.
The trace dump shows that main session is waiting for 'library cache pin' in the step of dbms_aqadm.stop_queue; whereas job session is waiting for 'enq: TM - contention' with Current SQL Statement:
select ... from queue_table ... for update skip locked;
Although reproducing this deadlock is not evident, we can still try to make some plausible experiment and some sort of post-mortem analysis.
At first we open 3 Sqlplus sessions:
observer session: observer_sid
main session: main_sid
job session: job_sid
before and after each step, we run the following 3 statements to check the delta of statistics:
select sid, type, id1, id2, lmode, request,ctime, block
from v$lock where type in ('TM', 'TQ');
select eq_type, eq_name, req_reason, total_req#, total_wait#,succ_req#,failed_req#, cum_wait_time
from v$enqueue_statistics where eq_type in ('TQ', 'TM');
select locked_total, pinned_total, locks, pins, lock_mode, pin_mode, executions
from v$db_object_cache v where name = 'KSUN_QUEUE';
Inspired by Perturbation Theory, we "perturbe" time series by a table lock so that we can microscope the Oracle internal locking mechanisms.
Here is the time sequence and actions of first experiment on the observer session and main session as well as the noticed statistics change:
T1 observer_sid --- start_queue
T2 observer_sid --- lock queue table in exclusive mode
T3 main_sid --- stop_queue
T4 observer_sid --- check stats
main_sid got 1 TQ Lock in Mode 6, requests 1 TM in Mode 6
TM DML: TOTAL_REQ# increased 1, but SUCC_REQ# not changed
TQ Queue table enqueue DDL contention: TOTAL_REQ# increased 1, SUCC_REQ# increased 1
LOCKED_TOTAL increased 1, PINNED_TOTAL increased 1
T5 observer_sid --- rollback to unlock table
TM DML: TOTAL_REQ# increased 6, SUCC_REQ# increased 7
LOCKED_TOTAL increased 1, PINNED_TOTAL increased 1
Here is the second experiment on the observer session and job session:
T6 observer_sid --- start_queue
T7 observer_sid --- lock queue table in exclusive mode
T8 job_sid --- dequeue with dequeue_options.wait being 3 seconds
T9 observer_sid --- check stats
job_sid requests 1 TM in Mode 3
TM DML: TOTAL_REQ# increased 1, but SUCC_REQ# not changed
LOCKED_TOTAL increased 1, PINNED_TOTAL increased 1
T10 observer_sid --- rollback to unlock table
TM DML: TOTAL_REQ# increased 2, but SUCC_REQ# increased 3
LOCKED_TOTAL increased 1, PINNED_TOTAL increased 2
According the above statistics change, we can conclude:
(1). "stop_queue" got 'TQ Queue table enqueue DDL contention' before getting TM DML,
and again 'library cache pin' / 'library cache lock' after getting TM DM.
(2). "dequeue" got 'library cache pin'/'library cache lock' before getting TM DML,
and again 'library cache pin'/'library cache lock' after getting TM DM.
that probably explains:
(a). when main session "stop_queue", it first holds 'TQ Queue table enqueue DDL contention' (and hence TM) lock
and then is waiting for 'library cache pin';
(b). whereas job session when "dequeue", it first holds 'library cache pin'/'library cache lock'
and then is waiting for 'enq: TM - contention' with Current SQL Statement:
select ... from queue_table ... for update skip locked;
Because 'library cache pin' and 'enq: TM - contention' in main session and job session are acquired and requested in reverse order, it signals a deadlock.
Oracle has 4 different req_reasons for TQ:
TQ DDL-INI contention
TQ INI contention
TQ DDL contention
TQ TM contention
what the first experiment shows is that "stop_queue" requires "DDL contention", which seems incurring 1 TQ in Mode 6 firstly and subsequently 1 TM in Mode 6 on queue table.
Probably one fix is to ensure no more Job sessions running before stepping into stop_queue.
Be careful that dbms_job.broken, dbms_job.remove, dbms_scheduler.disable, dbms_scheduler.stop_job, dbms_scheduler.drop_job seem not terminating job session immediately.
The dump file contains some more details about main session:
time in wait: 1.309131 sec
timeout after: 14 min 58 sec
and job session:
time in wait: 1.309007 sec
timeout after: never
Usually Oracle picks the longer waiting session to resolve the deadlock. In this case, it is the main session which raises an error, and job session continues its processing after receiving the lock.
The above text confirms again the "library cache pin" 15 minutes timeout mentioned in Blog:
Oracle 11gR2 single session "library cache pin"
15 minutes timeout can also be justified by following code.
In the first session, execute:
create or replace procedure proc as
begin
execute immediate 'alter procedure proc compile';
end;
/
exec proc;
/
In the second session, run this query a few times:
select in_wait_secs, time_remaining_secs, (in_wait_secs + time_remaining_secs) total_secs
from v$wait_chains;
two evidences can be observed:
(I). total_secs displays constantly 899 seconds (exit when elapsed > 899, that is, one second less than 15 minutes).
(II). both in_wait_secs and time_remaining_secs are updated each 3 seconds (deadlock detection interval).
after 899 seconds, the first session throws:
ORA-04021: timeout occurred while waiting to lock object
In fact, 15 minutes comes from default setting of "_kgl_time_to_wait_for_locks", which controls time to wait for locks and pins before timing out (time in minute, modification requires instance restart).
By the way, following statement requests a TM Lock in Mode 3 even the table is empty.
select ... from queue_table ... for update skip locked;