Wednesday, November 21, 2012

One Oracle AQ Deadlock

A few days ago, an AQ based Oracle application (Oracle 11.2.0.3.0) threw a deadlock after more than 1,000,000 times of unobstructed running since a few years.

The incident is frustrating since the reliability of AQ based implementation is disputed, and resorting to Oracle appears being a long way to go.

This demonstrates again that correctness proof of computer programs is hardly achievable, let alone concurrent programs.

The deadlock is caused by the following simplified pseudo code:

 dbms_aqadm.start_queue;
 enqueue some message;
 dbms_job.submit to launch a dequeue job in background;  -- can also be dbms_scheduler
 wait till queue is empty;
 dbms_aqadm.stop_queue;


which is started in main session and concurrently forks background job session.

The trace dump shows that main session is waiting for 'library cache pin' in the step of dbms_aqadm.stop_queue; whereas job session is waiting for 'enq: TM - contention' with Current SQL Statement:

 select ... from queue_table ... for update skip locked;

Although reproducing this deadlock is not evident, we can still try to make some plausible experiment and some sort of post-mortem analysis.

At first we open 3 Sqlplus sessions:

  observer session: observer_sid
 main session:     main_sid
 job session:      job_sid


before and after each step, we run the following 3 statements to check the delta of statistics:

 select sid, type, id1, id2, lmode, request,ctime, block
   from v$lock where type in ('TM', 'TQ');
 select eq_type, eq_name, req_reason, total_req#, total_wait#,succ_req#,failed_req#, cum_wait_time
   from v$enqueue_statistics where eq_type in ('TQ', 'TM');
 select locked_total, pinned_total, locks, pins, lock_mode, pin_mode, executions
   from v$db_object_cache v where name = 'KSUN_QUEUE';


Inspired by Perturbation Theory, we "perturbe" time series by a table lock so that we can microscope the Oracle internal locking mechanisms.

Here is the time sequence and actions of first experiment on the observer session and main session as well as the noticed statistics change:

 T1 observer_sid --- start_queue
 T2 observer_sid --- lock queue table in exclusive mode
 T3 main_sid     --- stop_queue
 T4 observer_sid --- check stats
    main_sid got 1 TQ Lock in Mode 6, requests 1 TM in Mode 6
    TM DML: TOTAL_REQ# increased 1, but SUCC_REQ# not changed
    TQ Queue table enqueue DDL contention: TOTAL_REQ# increased 1, SUCC_REQ# increased 1
    LOCKED_TOTAL increased 1, PINNED_TOTAL increased 1
 T5 observer_sid --- rollback to unlock table
    TM DML: TOTAL_REQ# increased 6, SUCC_REQ# increased 7
    LOCKED_TOTAL increased 1, PINNED_TOTAL increased 1


Here is the second experiment on the observer session and job session:

 T6 observer_sid --- start_queue
 T7 observer_sid --- lock queue table in exclusive mode
 T8 job_sid --- dequeue with dequeue_options.wait being 3 seconds
 T9 observer_sid --- check stats
    job_sid requests 1 TM in Mode 3
    TM DML: TOTAL_REQ# increased 1, but SUCC_REQ# not changed
    LOCKED_TOTAL increased 1, PINNED_TOTAL increased 1
 T10 observer_sid --- rollback to unlock table
    TM DML: TOTAL_REQ# increased 2, but SUCC_REQ# increased 3
    LOCKED_TOTAL increased 1, PINNED_TOTAL increased 2


According the above statistics change, we can conclude:

 (1). "stop_queue" got 'TQ Queue table enqueue DDL contention' before getting TM DML,
         and again 'library cache pin' / 'library cache lock' after getting TM DM.

 (2). "dequeue" got 'library cache pin'/'library cache lock' before getting TM DML,
         and again 'library cache pin'/'library cache lock' after getting TM DM.

that probably explains:

(a). when main session "stop_queue", it first holds 'TQ Queue table enqueue DDL contention' (and hence TM) lock
      and then is waiting for 'library cache pin';

(b). whereas job session when "dequeue", it first holds 'library cache pin'/'library cache lock'
      and then is waiting for 'enq: TM - contention' with Current SQL Statement:

         select ... from queue_table ... for update skip locked;

Because 'library cache pin' and 'enq: TM - contention' in main session and job session are acquired and requested in reverse order, it signals a deadlock.

Oracle has 4 different req_reasons for TQ:

 TQ DDL-INI contention
 TQ INI contention
 TQ DDL contention
 TQ TM contention


what the first experiment shows is that "stop_queue" requires "DDL contention", which seems incurring 1 TQ in Mode 6 firstly and subsequently 1 TM in Mode 6 on queue table.

Probably one fix is to ensure no more Job sessions running before stepping into stop_queue.

Be careful that dbms_job.broken, dbms_job.remove, dbms_scheduler.disable, dbms_scheduler.stop_job, dbms_scheduler.drop_job seem not terminating job session immediately.

The dump file contains some more details about main session:

           time in wait: 1.309131 sec
    timeout after: 14 min 58 sec

          
and job session:
       
            time in wait: 1.309007 sec
     timeout after: never
  
          
Usually Oracle picks the longer waiting session to resolve the deadlock. In this case, it is the main session which raises an error, and job session continues its processing after receiving the lock.
   
The above text confirms again the "library cache pin" 15 minutes timeout mentioned in Blog:
Oracle 11gR2 single session "library cache pin"

15 minutes timeout can also be justified by following code.

In the first session, execute:

 create or replace procedure proc as
 begin
   execute immediate 'alter procedure proc compile';
 end;
 /

 exec proc;
 /


In the second session, run this query a few times:

 select in_wait_secs, time_remaining_secs, (in_wait_secs + time_remaining_secs) total_secs
   from v$wait_chains;

two evidences can be observed:

(I).  total_secs displays constantly 899 seconds (exit when elapsed > 899, that is, one second less than 15 minutes).

(II). both in_wait_secs and time_remaining_secs are updated each 3 seconds (deadlock detection interval).

after 899 seconds, the first session throws:

 ORA-04021: timeout occurred while waiting to lock object

In fact, 15 minutes comes from default setting of "_kgl_time_to_wait_for_locks", which controls time to wait for locks and pins before timing out (time in minute, modification requires instance restart).

By the way, following statement requests a TM Lock in Mode 3 even the table is empty.

 select ... from queue_table ... for update skip locked; 

Friday, November 16, 2012

Oracle Java OutOfMemoryError


A long time perplexed issue is that a Java stored procedure in Oracle throws a java.lang.OutOfMemoryError very seldom, once a couple of months since a few years.

Analog to Java non-standard option: -Xmx, the appended test code 1 attempts to use:
   OracleRuntime.setMaxMemorySize
to provoke OutOfMemoryError.

On a Sqlplus window, when one runs:

set serveroutput on size 50000
exec dbms_java.set_output(50000);
exec testp1(500000); 

we get the error:

Exception in thread "Root Thread" java.lang.Throwable:
--- Message ---
null
--- Cause ---
null
--- StackTrace ---
java.lang.OutOfMemoryError
        at JavaOutput.output(JavaOutput:20)
        at JavaOutput.output(JavaOutput:28)

Open a new Sqlplus window (hence a new Oracle session), when one runs:
        
set serveroutput on size 50000
exec dbms_java.set_output(50000);
exec testp1(900000); 

call succeeds. If one call again:
  exec testp1(500000);
no more error.

Test Code 2 is the original Java stored procedure. It tries to make some post-processing after OutOfMemoryError, which causes further OutOfMemoryError.

There are two deficiencies with this code.

(1). It converts OutOfMemoryError (a subclass of Error) into RuntimeException (a subclass of Exception). And Java Doc on Class Error said:

An Error is a subclass of Throwable that indicates serious problems that a reasonable application should not try to catch.

(2). There is no OutOfMemoryError printed in the output:
         
Exception in thread "Root Thread" JavaOutput$WrapException:
--- Message ---
null
        at JavaOutput$ExceptHelper.setError(JavaOutput:41)
        at JavaOutput.output(JavaOutput:25)

and the root cause is discarded.

The reason is that "message" and "cause" are optional in Java VM implementation and it seems
that Oracle JVM does not include them. Here is an excerpt of Java Doc about class Throwable:

 It can also contain a message string that gives more information about the error.
 Finally, it can contain a cause: another throwable that caused this throwable to get thrown.


Test Code 1

(tested in Oracle 11.2.0.3.0 on AIX and Solaris):

CREATE OR REPLACE AND RESOLVE JAVA SOURCE NAMED "JavaOutput" as 
import java.io.InputStream;
import java.io.OutputStream;
import java.io.FileOutputStream;
import java.io.Writer;
import java.io.StringWriter;
import java.io.PrintWriter;
import oracle.sql.BLOB;
import oracle.aurora.vm.OracleRuntime;
public class JavaOutput {
  public static void output(String file, BLOB blob, long rtMaxMSSet) throws Throwable {
   InputStream  inStream = null;
   OutputStream outStream = null;
    try {
          OracleRuntime.setMaxMemorySize(rtMaxMSSet);
          
          int bufferSize = blob.getBufferSize();
          inStream  = blob.getBinaryStream();
          outStream = new FileOutputStream(file, true);
          
          byte[] buffer = new byte[bufferSize];
          int length = -1;
          while ((length = inStream.read(buffer)) != -1) {
                outStream.write(buffer, 0, length);
          }
    } catch (Throwable t) {
          String msg = t.getMessage();
          String stackTrc = getStackTrace(t);
          throw new Throwable("\n --- Message ---\n"    + t.getMessage() + 
                              "\n --- Cause ---\n"      + t.getCause() + 
                              "\n --- StackTrace ---\n" + stackTrc);
    } finally {
           if (inStream != null) {
                 inStream.close();
           }
           if (outStream != null) {
                 outStream.close();
           }
    }
  }
  static String getStackTrace(Throwable t)
  {
    Writer writer = new StringWriter();
    PrintWriter printWriter = new PrintWriter(writer);
    t.printStackTrace(printWriter);
    return writer.toString();
  }
}
/
create or replace procedure java_output(
  p_file_name               varchar2
 ,p_blob                    blob
 ,p_max_size                number
  ) as language java name 'JavaOutput.output(java.lang.String, oracle.sql.BLOB, long)';
/
create or replace procedure testp1(p_max_size number) as
  l_blob_huge               blob  := empty_blob;
  l_file_name               varchar2(200) ;
  l_raw                     raw(100);
  begin
    select  diagnostic_dest||'/diag/rdbms/' || db_name||'/'|| db_name||'/trace/ksun_output.txt'
    into    l_file_name
    from   ( select
             (select value from v$parameter where name = 'diagnostic_dest') diagnostic_dest
            ,(select value from v$parameter where name = 'db_name') db_name
            from dual);
    
    dbms_lob.createtemporary(l_blob_huge, true);
    l_raw :=  '4b53554e'; --ksun;
    dbms_lob.writeappend(l_blob_huge, lengthb(l_raw)/2, l_raw);
    java_output(l_file_name, l_blob_huge, p_max_size);
end testp1;
/

Test Code 2:


CREATE OR REPLACE AND RESOLVE JAVA SOURCE NAMED "JavaOutput" as 
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.FileOutputStream;
import java.sql.SQLException;
import oracle.sql.BLOB;
import oracle.aurora.vm.OracleRuntime;
public class JavaOutput {
  public static void output(String file, BLOB blob, long rtMaxMSSet) throws IOException, SQLException {
   InputStream  inStream = null;
   OutputStream outStream = null;  
    try {
          OracleRuntime.setMaxMemorySize(rtMaxMSSet);
          
          int bufferSize = blob.getBufferSize();
          inStream  = blob.getBinaryStream();
          outStream = new FileOutputStream(file, true);
          
          byte[] buffer = new byte[bufferSize];
          int length = -1;
          while ((length = inStream.read(buffer)) != -1) {
                outStream.write(buffer, 0, length);
          }
    } catch (Throwable t) {
          ExceptHelper.setError(t);
    } finally {
           if (inStream != null) {
                 inStream.close();
           }
           if (outStream != null) {
                 outStream.close();
           }
    }
  }
  
  static class ExceptHelper {
    static void setError(Throwable t) {
      try {
          byte[] errReport = new byte[400000];  
      } finally {
          throw new WrapException("\n --- Message ---\n" + t.getMessage());
      } 
    }
  }
  
  static class WrapException extends RuntimeException {
    WrapException(String msg) {
      super(msg);
    }
  }
}
/