A few days ago, an AQ based Oracle application (Oracle 11.2.0.3.0) threw a deadlock after more than 1,000,000 times of unobstructed running since a few years.
The incident is frustrating since the reliability of AQ based implementation is disputed, and resorting to Oracle appears being a long way to go.
This demonstrates again that correctness proof of computer programs is hardly achievable, let alone concurrent programs.
The deadlock is caused by the following simplified pseudo code:
dbms_aqadm.start_queue;
enqueue some message;
dbms_job.submit to launch a dequeue job in background; -- can also be dbms_scheduler
wait till queue is empty;
dbms_aqadm.stop_queue;
which is started in main session and concurrently forks background job session.
The trace dump shows that main session is waiting for 'library cache pin' in the step of dbms_aqadm.stop_queue; whereas job session is waiting for 'enq: TM - contention' with Current SQL Statement:
select ... from queue_table ... for update skip locked;
Although reproducing this deadlock is not evident, we can still try to make some plausible experiment and some sort of post-mortem analysis.
At first we open 3 Sqlplus sessions:
observer session: observer_sid
main session: main_sid
job session: job_sid
before and after each step, we run the following 3 statements to check the delta of statistics:
select sid, type, id1, id2, lmode, request,ctime, block
from v$lock where type in ('TM', 'TQ');
select eq_type, eq_name, req_reason, total_req#, total_wait#,succ_req#,failed_req#, cum_wait_time
from v$enqueue_statistics where eq_type in ('TQ', 'TM');
select locked_total, pinned_total, locks, pins, lock_mode, pin_mode, executions
from v$db_object_cache v where name = 'KSUN_QUEUE';
Inspired by Perturbation Theory, we "perturbe" time series by a table lock so that we can microscope the Oracle internal locking mechanisms.
Here is the time sequence and actions of first experiment on the observer session and main session as well as the noticed statistics change:
T1 observer_sid --- start_queue
T2 observer_sid --- lock queue table in exclusive mode
T3 main_sid --- stop_queue
T4 observer_sid --- check stats
main_sid got 1 TQ Lock in Mode 6, requests 1 TM in Mode 6
TM DML: TOTAL_REQ# increased 1, but SUCC_REQ# not changed
TQ Queue table enqueue DDL contention: TOTAL_REQ# increased 1, SUCC_REQ# increased 1
LOCKED_TOTAL increased 1, PINNED_TOTAL increased 1
T5 observer_sid --- rollback to unlock table
TM DML: TOTAL_REQ# increased 6, SUCC_REQ# increased 7
LOCKED_TOTAL increased 1, PINNED_TOTAL increased 1
Here is the second experiment on the observer session and job session:
T6 observer_sid --- start_queue
T7 observer_sid --- lock queue table in exclusive mode
T8 job_sid --- dequeue with dequeue_options.wait being 3 seconds
T9 observer_sid --- check stats
job_sid requests 1 TM in Mode 3
TM DML: TOTAL_REQ# increased 1, but SUCC_REQ# not changed
LOCKED_TOTAL increased 1, PINNED_TOTAL increased 1
T10 observer_sid --- rollback to unlock table
TM DML: TOTAL_REQ# increased 2, but SUCC_REQ# increased 3
LOCKED_TOTAL increased 1, PINNED_TOTAL increased 2
According the above statistics change, we can conclude:
(1). "stop_queue" got 'TQ Queue table enqueue DDL contention' before getting TM DML,
and again 'library cache pin' / 'library cache lock' after getting TM DM.
(2). "dequeue" got 'library cache pin'/'library cache lock' before getting TM DML,
and again 'library cache pin'/'library cache lock' after getting TM DM.
that probably explains:
(a). when main session "stop_queue", it first holds 'TQ Queue table enqueue DDL contention' (and hence TM) lock
and then is waiting for 'library cache pin';
(b). whereas job session when "dequeue", it first holds 'library cache pin'/'library cache lock'
and then is waiting for 'enq: TM - contention' with Current SQL Statement:
select ... from queue_table ... for update skip locked;
Because 'library cache pin' and 'enq: TM - contention' in main session and job session are acquired and requested in reverse order, it signals a deadlock.
Oracle has 4 different req_reasons for TQ:
TQ DDL-INI contention
TQ INI contention
TQ DDL contention
TQ TM contention
what the first experiment shows is that "stop_queue" requires "DDL contention", which seems incurring 1 TQ in Mode 6 firstly and subsequently 1 TM in Mode 6 on queue table.
Probably one fix is to ensure no more Job sessions running before stepping into stop_queue.
Be careful that dbms_job.broken, dbms_job.remove, dbms_scheduler.disable, dbms_scheduler.stop_job, dbms_scheduler.drop_job seem not terminating job session immediately.
The dump file contains some more details about main session:
time in wait: 1.309131 sec
timeout after: 14 min 58 sec
and job session:
time in wait: 1.309007 sec
timeout after: never
Usually Oracle picks the longer waiting session to resolve the deadlock. In this case, it is the main session which raises an error, and job session continues its processing after receiving the lock.
The above text confirms again the "library cache pin" 15 minutes timeout mentioned in Blog:
Oracle 11gR2 single session "library cache pin"
15 minutes timeout can also be justified by following code.
In the first session, execute:
create or replace procedure proc as
begin
execute immediate 'alter procedure proc compile';
end;
/
exec proc;
/
In the second session, run this query a few times:
select in_wait_secs, time_remaining_secs, (in_wait_secs + time_remaining_secs) total_secs
from v$wait_chains;
two evidences can be observed:
(I). total_secs displays constantly 899 seconds (exit when elapsed > 899, that is, one second less than 15 minutes).
(II). both in_wait_secs and time_remaining_secs are updated each 3 seconds (deadlock detection interval).
after 899 seconds, the first session throws:
ORA-04021: timeout occurred while waiting to lock object
In fact, 15 minutes comes from default setting of "_kgl_time_to_wait_for_locks", which controls time to wait for locks and pins before timing out (time in minute, modification requires instance restart).
By the way, following statement requests a TM Lock in Mode 3 even the table is empty.
select ... from queue_table ... for update skip locked;
Wednesday, November 21, 2012
Friday, November 16, 2012
Oracle Java OutOfMemoryError
A long time perplexed issue is that a Java stored procedure in Oracle throws a java.lang.OutOfMemoryError very seldom, once a couple of months since a few years.
Analog to Java non-standard option: -Xmx, the appended test code 1 attempts to use:
OracleRuntime.setMaxMemorySize
to provoke OutOfMemoryError.
On a Sqlplus window, when one runs:
set serveroutput on size 50000
exec dbms_java.set_output(50000);
exec testp1(500000);
we get the error:
Exception in thread "Root Thread" java.lang.Throwable:
--- Message ---
null
--- Cause ---
null
--- StackTrace ---
java.lang.OutOfMemoryError
at JavaOutput.output(JavaOutput:20)
at JavaOutput.output(JavaOutput:28)
Open a new Sqlplus window (hence a new Oracle session), when one runs:
set serveroutput on size 50000
exec dbms_java.set_output(50000);
exec testp1(900000);
call succeeds. If one call again: exec testp1(500000);
no more error.
Test Code 2 is the original Java stored procedure. It tries to make some post-processing after OutOfMemoryError, which causes further OutOfMemoryError.
There are two deficiencies with this code.
(1). It converts OutOfMemoryError (a subclass of Error) into RuntimeException (a subclass of Exception). And Java Doc on Class Error said:
An Error is a subclass of Throwable that indicates serious problems that a reasonable application should not try to catch.
(2). There is no OutOfMemoryError printed in the output:
Exception in thread "Root Thread" JavaOutput$WrapException:
--- Message ---
null
at JavaOutput$ExceptHelper.setError(JavaOutput:41)
at JavaOutput.output(JavaOutput:25)
and the root cause is discarded.The reason is that "message" and "cause" are optional in Java VM implementation and it seems
that Oracle JVM does not include them. Here is an excerpt of Java Doc about class Throwable:
It can also contain a message string that gives more information about the error.
Finally, it can contain a cause: another throwable that caused this throwable to get thrown.
Test Code 1
(tested in Oracle 11.2.0.3.0 on AIX and Solaris):
CREATE OR REPLACE AND RESOLVE JAVA SOURCE NAMED "JavaOutput" as
import java.io.InputStream;
import java.io.OutputStream;
import java.io.FileOutputStream;
import java.io.Writer;
import java.io.StringWriter;
import java.io.PrintWriter;
import oracle.sql.BLOB;
import oracle.aurora.vm.OracleRuntime;
public class JavaOutput {
public static void output(String file, BLOB blob, long rtMaxMSSet) throws Throwable {
InputStream inStream = null;
OutputStream outStream = null;
try {
OracleRuntime.setMaxMemorySize(rtMaxMSSet);
int bufferSize = blob.getBufferSize();
inStream = blob.getBinaryStream();
outStream = new FileOutputStream(file, true);
byte[] buffer = new byte[bufferSize];
int length = -1;
while ((length = inStream.read(buffer)) != -1) {
outStream.write(buffer, 0, length);
}
} catch (Throwable t) {
String msg = t.getMessage();
String stackTrc = getStackTrace(t);
throw new Throwable("\n --- Message ---\n" + t.getMessage() +
"\n --- Cause ---\n" + t.getCause() +
"\n --- StackTrace ---\n" + stackTrc);
} finally {
if (inStream != null) {
inStream.close();
}
if (outStream != null) {
outStream.close();
}
}
}
static String getStackTrace(Throwable t)
{
Writer writer = new StringWriter();
PrintWriter printWriter = new PrintWriter(writer);
t.printStackTrace(printWriter);
return writer.toString();
}
}
/
create or replace procedure java_output(
p_file_name varchar2
,p_blob blob
,p_max_size number
) as language java name 'JavaOutput.output(java.lang.String, oracle.sql.BLOB, long)';
/
create or replace procedure testp1(p_max_size number) as
l_blob_huge blob := empty_blob;
l_file_name varchar2(200) ;
l_raw raw(100);
begin
select diagnostic_dest||'/diag/rdbms/' || db_name||'/'|| db_name||'/trace/ksun_output.txt'
into l_file_name
from ( select
(select value from v$parameter where name = 'diagnostic_dest') diagnostic_dest
,(select value from v$parameter where name = 'db_name') db_name
from dual);
dbms_lob.createtemporary(l_blob_huge, true);
l_raw := '4b53554e'; --ksun;
dbms_lob.writeappend(l_blob_huge, lengthb(l_raw)/2, l_raw);
java_output(l_file_name, l_blob_huge, p_max_size);
end testp1;
/
Test Code 2:
CREATE OR REPLACE AND RESOLVE JAVA SOURCE NAMED "JavaOutput" as
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.FileOutputStream;
import java.sql.SQLException;
import oracle.sql.BLOB;
import oracle.aurora.vm.OracleRuntime;
public class JavaOutput {
public static void output(String file, BLOB blob, long rtMaxMSSet) throws IOException, SQLException {
InputStream inStream = null;
OutputStream outStream = null;
try {
OracleRuntime.setMaxMemorySize(rtMaxMSSet);
int bufferSize = blob.getBufferSize();
inStream = blob.getBinaryStream();
outStream = new FileOutputStream(file, true);
byte[] buffer = new byte[bufferSize];
int length = -1;
while ((length = inStream.read(buffer)) != -1) {
outStream.write(buffer, 0, length);
}
} catch (Throwable t) {
ExceptHelper.setError(t);
} finally {
if (inStream != null) {
inStream.close();
}
if (outStream != null) {
outStream.close();
}
}
}
static class ExceptHelper {
static void setError(Throwable t) {
try {
byte[] errReport = new byte[400000];
} finally {
throw new WrapException("\n --- Message ---\n" + t.getMessage());
}
}
}
static class WrapException extends RuntimeException {
WrapException(String msg) {
super(msg);
}
}
}
/