Friday, December 7, 2012

Delete Restart and new Active Undo Extent

Oracle dbms_lock.allocate_unique allocates a unique lock ID to a named lock,  and returns the handle associated with the lock ID so that application can use this handle in subsequent calls to request, convert, and release.

allocate_unique occasionally runs:
   delete dbms_lock_allocated where expiration < sysdate;
to clean up expired UL locks.

This delete statement makes me rethink my previous Blog: "Update Restart and new Active Undo Extent"

is this delete can also cause Restart ?

The appended test code confirms it (tested on AIX and Solaris in Release 11.2.0.3.0).

When several sessions call allocate_unique, it can cause certain "enq: TX - row lock contention" on dbms_lock_allocated.

As a unintentional side effect, once more than 100000 "UL" locks are generated by dbms_lock.request, queries on v$lock and v$enqueue_lock are heavily affected, even after all "UL" locks are released.

Test Code


create or replace function get_undo_extent_cnt return number
as
  l_cnt number;
begin  
 select count(*) into l_cnt from dba_undo_extents where status = 'ACTIVE';
 return l_cnt;
end;
/


create or replace package pacepack
as
  type array is table of number index by varchar2(40);
  g_data                  array;
  g_cnt                   number;
  g_restart_cnt           number;
  g_undo_extent_cnt       number;
end pacepack;
/


drop table delete_data;

create table delete_data as
  select sysdate + level/86400 tim, rpad('abc', 100, 'x') txt from dual connect by level <= 100000;
   
create or replace trigger delete_data_bt
  before delete or update on delete_data
  for each row
begin
  if ( pacepack.g_data.exists(rowidtochar(:old.rowid)))
  then
    pacepack.g_restart_cnt := pacepack.g_restart_cnt + 1;
  
    dbms_output.put_line( 'doing "' || :old.rowid ||
                           '" again was called ' || pacepack.g_cnt );
    dbms_output.put_line('-- Restart#: ' || pacepack.g_restart_cnt ||
                            ', Undo Extent Count: ' || get_undo_extent_cnt);
  else
    pacepack.g_data(rowidtochar(:old.rowid)) := 1;
  end if;
     
  pacepack.g_cnt        := pacepack.g_cnt + 1;
end;
/


create or replace procedure testp as
  l_cnt     number := 0;
  l_del_cnt number;
begin
  pacepack.g_data.delete;
  pacepack.g_cnt         := 0;
  pacepack.g_restart_cnt := 0;
       
  delete delete_data where tim <= (select max(tim) from delete_data);
  
  l_del_cnt := sql%rowcount;
  dbms_output.put_line( 'trigger_count (' || pacepack.g_cnt         || ') = ' ||
                        'delete_count  (' || l_del_cnt              || ') + ' ||
                        'restart_count (' || pacepack.g_restart_cnt || ') = ' ||
                         (l_del_cnt +  pacepack.g_restart_cnt)
                      );
       
  commit;
 
 end;
/


exec testp;
/

Output


doing "AAC2jpAAAAAHmm2AAF" again was called 10716
-- Restart#: 1, Undo Extent Count: 5
doing "AAC2jpAAAAAHmoFAAS" again was called 15581
-- Restart#: 2, Undo Extent Count: 6
doing "AAC2jpAAAAAHmpSAAf" again was called 20446
-- Restart#: 3, Undo Extent Count: 7
doing "AAC2jpAAAAAHmqhAAs" again was called 25311
-- Restart#: 4, Undo Extent Count: 8
doing "AAC2jpAAAAAHmruAA5" again was called 30176
-- Restart#: 5, Undo Extent Count: 9
doing "AAC2jpAAAAAHms+AAH" again was called 35041
-- Restart#: 6, Undo Extent Count: 10
doing "AAC2jpAAAAAHmuNAAU" again was called 39906
-- Restart#: 7, Undo Extent Count: 11
doing "AAC2jpAAAAAHmvaAAh" again was called 44771
-- Restart#: 8, Undo Extent Count: 12
doing "AAC2jpAAAAAHmwpAAu" again was called 49636
-- Restart#: 9, Undo Extent Count: 13
doing "AAC2jpAAAAAHmx2AA7" again was called 54501
-- Restart#: 10, Undo Extent Count: 14
doing "AAC2jpAAAAAHmzGAAJ" again was called 59366
-- Restart#: 11, Undo Extent Count: 15
doing "AAC2jpAAAAAHm0VAAW" again was called 64231
-- Restart#: 12, Undo Extent Count: 16
doing "AAC2jpAAAAAHm1iAAj" again was called 69096
-- Restart#: 13, Undo Extent Count: 17
doing "AAC2jpAAAAAHm2xAAw" again was called 73961
-- Restart#: 14, Undo Extent Count: 18
doing "AAC2jpAAAAAHm3+AA9" again was called 78826
-- Restart#: 15, Undo Extent Count: 19
doing "AAC2jpAAAAAHm5OAAL" again was called 83691
-- Restart#: 16, Undo Extent Count: 20
doing "AAC2jpAAAAAHm6dAAY" again was called 88556
-- Restart#: 17, Undo Extent Count: 21
doing "AAC2jpAAAAAHm7qAAl" again was called 93421
-- Restart#: 18, Undo Extent Count: 22
doing "AAC2jpAAAAAHm85AAy" again was called 98286
-- Restart#: 19, Undo Extent Count: 23
trigger_count (100019) = delete_count  (100000) + restart_count (19) = 100019


So for each restart, a new active Undo Extent is created.

Wednesday, November 21, 2012

One Oracle AQ Deadlock

A few days ago, an AQ based Oracle application (Oracle 11.2.0.3.0) threw a deadlock after more than 1,000,000 times of unobstructed running since a few years.

The incident is frustrating since the reliability of AQ based implementation is disputed, and resorting to Oracle appears being a long way to go.

This demonstrates again that correctness proof of computer programs is hardly achievable, let alone concurrent programs.

The deadlock is caused by the following simplified pseudo code:

 dbms_aqadm.start_queue;
 enqueue some message;
 dbms_job.submit to launch a dequeue job in background;  -- can also be dbms_scheduler
 wait till queue is empty;
 dbms_aqadm.stop_queue;


which is started in main session and concurrently forks background job session.

The trace dump shows that main session is waiting for 'library cache pin' in the step of dbms_aqadm.stop_queue; whereas job session is waiting for 'enq: TM - contention' with Current SQL Statement:

 select ... from queue_table ... for update skip locked;

Although reproducing this deadlock is not evident, we can still try to make some plausible experiment and some sort of post-mortem analysis.

At first we open 3 Sqlplus sessions:

  observer session: observer_sid
 main session:     main_sid
 job session:      job_sid


before and after each step, we run the following 3 statements to check the delta of statistics:

 select sid, type, id1, id2, lmode, request,ctime, block
   from v$lock where type in ('TM', 'TQ');
 select eq_type, eq_name, req_reason, total_req#, total_wait#,succ_req#,failed_req#, cum_wait_time
   from v$enqueue_statistics where eq_type in ('TQ', 'TM');
 select locked_total, pinned_total, locks, pins, lock_mode, pin_mode, executions
   from v$db_object_cache v where name = 'KSUN_QUEUE';


Inspired by Perturbation Theory, we "perturbe" time series by a table lock so that we can microscope the Oracle internal locking mechanisms.

Here is the time sequence and actions of first experiment on the observer session and main session as well as the noticed statistics change:

 T1 observer_sid --- start_queue
 T2 observer_sid --- lock queue table in exclusive mode
 T3 main_sid     --- stop_queue
 T4 observer_sid --- check stats
    main_sid got 1 TQ Lock in Mode 6, requests 1 TM in Mode 6
    TM DML: TOTAL_REQ# increased 1, but SUCC_REQ# not changed
    TQ Queue table enqueue DDL contention: TOTAL_REQ# increased 1, SUCC_REQ# increased 1
    LOCKED_TOTAL increased 1, PINNED_TOTAL increased 1
 T5 observer_sid --- rollback to unlock table
    TM DML: TOTAL_REQ# increased 6, SUCC_REQ# increased 7
    LOCKED_TOTAL increased 1, PINNED_TOTAL increased 1


Here is the second experiment on the observer session and job session:

 T6 observer_sid --- start_queue
 T7 observer_sid --- lock queue table in exclusive mode
 T8 job_sid --- dequeue with dequeue_options.wait being 3 seconds
 T9 observer_sid --- check stats
    job_sid requests 1 TM in Mode 3
    TM DML: TOTAL_REQ# increased 1, but SUCC_REQ# not changed
    LOCKED_TOTAL increased 1, PINNED_TOTAL increased 1
 T10 observer_sid --- rollback to unlock table
    TM DML: TOTAL_REQ# increased 2, but SUCC_REQ# increased 3
    LOCKED_TOTAL increased 1, PINNED_TOTAL increased 2


According the above statistics change, we can conclude:

 (1). "stop_queue" got 'TQ Queue table enqueue DDL contention' before getting TM DML,
         and again 'library cache pin' / 'library cache lock' after getting TM DM.

 (2). "dequeue" got 'library cache pin'/'library cache lock' before getting TM DML,
         and again 'library cache pin'/'library cache lock' after getting TM DM.

that probably explains:

(a). when main session "stop_queue", it first holds 'TQ Queue table enqueue DDL contention' (and hence TM) lock
      and then is waiting for 'library cache pin';

(b). whereas job session when "dequeue", it first holds 'library cache pin'/'library cache lock'
      and then is waiting for 'enq: TM - contention' with Current SQL Statement:

         select ... from queue_table ... for update skip locked;

Because 'library cache pin' and 'enq: TM - contention' in main session and job session are acquired and requested in reverse order, it signals a deadlock.

Oracle has 4 different req_reasons for TQ:

 TQ DDL-INI contention
 TQ INI contention
 TQ DDL contention
 TQ TM contention


what the first experiment shows is that "stop_queue" requires "DDL contention", which seems incurring 1 TQ in Mode 6 firstly and subsequently 1 TM in Mode 6 on queue table.

Probably one fix is to ensure no more Job sessions running before stepping into stop_queue.

Be careful that dbms_job.broken, dbms_job.remove, dbms_scheduler.disable, dbms_scheduler.stop_job, dbms_scheduler.drop_job seem not terminating job session immediately.

The dump file contains some more details about main session:

           time in wait: 1.309131 sec
    timeout after: 14 min 58 sec

          
and job session:
       
            time in wait: 1.309007 sec
     timeout after: never
  
          
Usually Oracle picks the longer waiting session to resolve the deadlock. In this case, it is the main session which raises an error, and job session continues its processing after receiving the lock.
   
The above text confirms again the "library cache pin" 15 minutes timeout mentioned in Blog:
Oracle 11gR2 single session "library cache pin"

15 minutes timeout can also be justified by following code.

In the first session, execute:

 create or replace procedure proc as
 begin
   execute immediate 'alter procedure proc compile';
 end;
 /

 exec proc;
 /


In the second session, run this query a few times:

 select in_wait_secs, time_remaining_secs, (in_wait_secs + time_remaining_secs) total_secs
   from v$wait_chains;

two evidences can be observed:

(I).  total_secs displays constantly 899 seconds (exit when elapsed > 899, that is, one second less than 15 minutes).

(II). both in_wait_secs and time_remaining_secs are updated each 3 seconds (deadlock detection interval).

after 899 seconds, the first session throws:

 ORA-04021: timeout occurred while waiting to lock object

In fact, 15 minutes comes from default setting of "_kgl_time_to_wait_for_locks", which controls time to wait for locks and pins before timing out (time in minute, modification requires instance restart).

By the way, following statement requests a TM Lock in Mode 3 even the table is empty.

 select ... from queue_table ... for update skip locked; 

Friday, November 16, 2012

Oracle Java OutOfMemoryError


A long time perplexed issue is that a Java stored procedure in Oracle throws a java.lang.OutOfMemoryError very seldom, once a couple of months since a few years.

Analog to Java non-standard option: -Xmx, the appended test code 1 attempts to use:
   OracleRuntime.setMaxMemorySize
to provoke OutOfMemoryError.

On a Sqlplus window, when one runs:

set serveroutput on size 50000
exec dbms_java.set_output(50000);
exec testp1(500000); 

we get the error:

Exception in thread "Root Thread" java.lang.Throwable:
--- Message ---
null
--- Cause ---
null
--- StackTrace ---
java.lang.OutOfMemoryError
        at JavaOutput.output(JavaOutput:20)
        at JavaOutput.output(JavaOutput:28)

Open a new Sqlplus window (hence a new Oracle session), when one runs:
        
set serveroutput on size 50000
exec dbms_java.set_output(50000);
exec testp1(900000); 

call succeeds. If one call again:
  exec testp1(500000);
no more error.

Test Code 2 is the original Java stored procedure. It tries to make some post-processing after OutOfMemoryError, which causes further OutOfMemoryError.

There are two deficiencies with this code.

(1). It converts OutOfMemoryError (a subclass of Error) into RuntimeException (a subclass of Exception). And Java Doc on Class Error said:

An Error is a subclass of Throwable that indicates serious problems that a reasonable application should not try to catch.

(2). There is no OutOfMemoryError printed in the output:
         
Exception in thread "Root Thread" JavaOutput$WrapException:
--- Message ---
null
        at JavaOutput$ExceptHelper.setError(JavaOutput:41)
        at JavaOutput.output(JavaOutput:25)

and the root cause is discarded.

The reason is that "message" and "cause" are optional in Java VM implementation and it seems
that Oracle JVM does not include them. Here is an excerpt of Java Doc about class Throwable:

 It can also contain a message string that gives more information about the error.
 Finally, it can contain a cause: another throwable that caused this throwable to get thrown.


Test Code 1

(tested in Oracle 11.2.0.3.0 on AIX and Solaris):

CREATE OR REPLACE AND RESOLVE JAVA SOURCE NAMED "JavaOutput" as 
import java.io.InputStream;
import java.io.OutputStream;
import java.io.FileOutputStream;
import java.io.Writer;
import java.io.StringWriter;
import java.io.PrintWriter;
import oracle.sql.BLOB;
import oracle.aurora.vm.OracleRuntime;
public class JavaOutput {
  public static void output(String file, BLOB blob, long rtMaxMSSet) throws Throwable {
   InputStream  inStream = null;
   OutputStream outStream = null;
    try {
          OracleRuntime.setMaxMemorySize(rtMaxMSSet);
          
          int bufferSize = blob.getBufferSize();
          inStream  = blob.getBinaryStream();
          outStream = new FileOutputStream(file, true);
          
          byte[] buffer = new byte[bufferSize];
          int length = -1;
          while ((length = inStream.read(buffer)) != -1) {
                outStream.write(buffer, 0, length);
          }
    } catch (Throwable t) {
          String msg = t.getMessage();
          String stackTrc = getStackTrace(t);
          throw new Throwable("\n --- Message ---\n"    + t.getMessage() + 
                              "\n --- Cause ---\n"      + t.getCause() + 
                              "\n --- StackTrace ---\n" + stackTrc);
    } finally {
           if (inStream != null) {
                 inStream.close();
           }
           if (outStream != null) {
                 outStream.close();
           }
    }
  }
  static String getStackTrace(Throwable t)
  {
    Writer writer = new StringWriter();
    PrintWriter printWriter = new PrintWriter(writer);
    t.printStackTrace(printWriter);
    return writer.toString();
  }
}
/
create or replace procedure java_output(
  p_file_name               varchar2
 ,p_blob                    blob
 ,p_max_size                number
  ) as language java name 'JavaOutput.output(java.lang.String, oracle.sql.BLOB, long)';
/
create or replace procedure testp1(p_max_size number) as
  l_blob_huge               blob  := empty_blob;
  l_file_name               varchar2(200) ;
  l_raw                     raw(100);
  begin
    select  diagnostic_dest||'/diag/rdbms/' || db_name||'/'|| db_name||'/trace/ksun_output.txt'
    into    l_file_name
    from   ( select
             (select value from v$parameter where name = 'diagnostic_dest') diagnostic_dest
            ,(select value from v$parameter where name = 'db_name') db_name
            from dual);
    
    dbms_lob.createtemporary(l_blob_huge, true);
    l_raw :=  '4b53554e'; --ksun;
    dbms_lob.writeappend(l_blob_huge, lengthb(l_raw)/2, l_raw);
    java_output(l_file_name, l_blob_huge, p_max_size);
end testp1;
/

Test Code 2:


CREATE OR REPLACE AND RESOLVE JAVA SOURCE NAMED "JavaOutput" as 
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.FileOutputStream;
import java.sql.SQLException;
import oracle.sql.BLOB;
import oracle.aurora.vm.OracleRuntime;
public class JavaOutput {
  public static void output(String file, BLOB blob, long rtMaxMSSet) throws IOException, SQLException {
   InputStream  inStream = null;
   OutputStream outStream = null;  
    try {
          OracleRuntime.setMaxMemorySize(rtMaxMSSet);
          
          int bufferSize = blob.getBufferSize();
          inStream  = blob.getBinaryStream();
          outStream = new FileOutputStream(file, true);
          
          byte[] buffer = new byte[bufferSize];
          int length = -1;
          while ((length = inStream.read(buffer)) != -1) {
                outStream.write(buffer, 0, length);
          }
    } catch (Throwable t) {
          ExceptHelper.setError(t);
    } finally {
           if (inStream != null) {
                 inStream.close();
           }
           if (outStream != null) {
                 outStream.close();
           }
    }
  }
  
  static class ExceptHelper {
    static void setError(Throwable t) {
      try {
          byte[] errReport = new byte[400000];  
      } finally {
          throw new WrapException("\n --- Message ---\n" + t.getMessage());
      } 
    }
  }
  
  static class WrapException extends RuntimeException {
    WrapException(String msg) {
      super(msg);
    }
  }
}
/

Monday, July 30, 2012

One Mutex Collision Test

Inspired by the recent Higgs boson-hunting experiments at CERN, this Blog is trying to make some Oracle Mutex collision test on Oracle 11.2.0.3.0, and to watch what new "particles" can be detected.

We start the test by (see appended Test Code):

begin
  ksun_mutex_test_jobs(64, 2);
end;

which launches 64 sessions running ksun_mutex_testp1 and 2 ksun_mutex_testp2.

ksun_mutex_testp1 runs the same sql statement by varying the NLS settings.

ksun_mutex_testp2 runs a query over v$sql in order to simulate some monitoring applications on sql statements. v$sql is chosen also because it is derived from x$kglcursor_child, which means any select on it will touch all child_cursors.

ksun_mutex_testp2 accelerates the collision, but does not play a decisive role in the test.


Here is the observed "particles":

1. Number of child_cursors


In theory, ksun_mutex_testp1 generates at most 800 (40 nls_language multiplied 20 nls_territory) child_cursors for the statement (sql_id: '754r1k9db5u80'):
    select id into l_id from testt where name = :B1;
but
   select count(*) from v$sql where sql_id ='754r1k9db5u80';
   >>> 66'510
 (with certain fluctuations, but more than 10'000)


10Jan2013 Update: Tested on new Oracle 11.2.0.3.0,

select count(*), min(child_number), max(child_number) from v$sql where sql_id ='754r1k9db5u80';
=> 23'027  0  99

count(*) can no more reach 66'510.

select count(*), min(child_number), max(child_number) from v$sql where sql_id ='754r1k9db5u80' and is_obsolete = 'N';

=> 78  0  77
It seems that "_cursor_obsolete_threshold" is active.      

 

2. Shared memory

select sharable_mem, persistent_mem, runtime_mem from v$sqlarea where sql_id ='754r1k9db5u80';

    >>> 845'643'351  292'590'880  226'019'744

3. Reason for child_cursors

 

select extract (xmltype ('<kroot>' || reason || '</kroot>'), '//ChildNumber').getStringVal ()   child
      ,extract (xmltype ('<kroot>' || reason || '</kroot>'), '//reason').getStringVal ()        reason
  from v$sql_shared_cursor
 where sql_id ='754r1k9db5u80';

>>> <ChildNumber>x</ChildNumber>    <reason>NLS Settings(0)</reason>

where x is the child_cursor number.

4. New MUTEX_TYPE: "hash table"


select * from v$mutex_sleep where mutex_type = 'hash table' order by sleeps;

    >>> hash table  kkshGetNextChild [KKSHBKLOC1]  1289912  0

I wonder if this is related to db_block_hash_buckets subpool in shared pool:

select * from v$sgastat where pool = 'shared pool' and name = 'db_block_hash_buckets';
   >>> shared pool  db_block_hash_buckets  2920448

Update in 18Mar2013:
 db_block_hash_buckets is for database block hash buckets. It is allocated in shared pool.
 It takes about 1% of db_cache_size for db_block_size = 8192, or 70 Bytes for each database block hash bucket (chain).
 For example, a DB with db_cache_size=262GB, db_block_hash_buckets needs 2848MB in shared pool.


5. System frozen

The most astonishing detection is that all started sessions are blocked after dozens of hours or even a couple of days by one session which is waiting either for an event:

       "library cache: mutex X"
or
    "library cache lock".
   
The waiters can be either "library cache: mutex X" or "library cache lock".

By dumping systemstate for the case of final blocker: library cache lock, we can see:

sid: 446 ser: 2025
    Current Wait Stack:
      Not in wait; last wait ended 1118 min 2 sec ago
    There are 65 sessions blocked by this session.
    Dumping one waiter:
      inst: 1, sid: 42, ser: 1249
      wait event: 'library cache lock'
        p1: 'handle address'=0xde5d11a8
        p2: 'lock address'=0xe019ed68
        p3: '100*mode+namespace'=0x520002

sid: 42 ser: 1249
    Current Wait Stack:
     0: waiting for 'library cache lock'
        handle address=0xde5d11a8, lock address=0xe019ed68, 100*mode+namespace=0x520002
        wait_id=78226 seq_num=20283 snap_id=347
        wait times: snap=51.136124 sec, exc=1118 min 2 sec, total=1118 min 2 sec
        wait times: max=infinite, heur=1118 min 2 sec
        wait counts: calls=22362 os=22362
        in_wait=1 iflags=0x15a2
    There is at least one session blocking this session.
      Dumping 1 direct blocker(s):
        inst: 1, sid: 446, ser: 2025
 
v$locks lists that final blocker session 446 holds 3 locks of type: CU, AE, JQ, and does not request any locks. Further check shows that only this session holds one CU lock. CU means "Recovers cursors in case of death while compiling",  so this final blocker is probably falling into an endless loop of cursor rescuing process.

If one tries to deblock the situation by killing the blocking session, one of the blocked sessions will take over its role, and the original blocking session joins to the blocked sessions. So the only fix could be to reboot the DB.

On the UNIX level, we can see this process is more than 95% on Usr CPU.

As checked from time to time, the system is getting gradually slower and finally totally hanging on one single session.

The whole tests were carried out on Solaris(x86-64) and AIX. Similar blocking behaviors were also reported by some real Oracle 11gR2 applications.

Probably these "particles" will not need 45-year to be confirmed by Oracle.

By the way, with Oracle 11.2.0.3.0, a new parameter is introduced:

Name:              _cursor_obsolete_threshold      
Description:     Number of cursors per parent before obsoletion.             
Default value:  100

Addendum (2015.03.25): Looking at Oracle 11.2.0.4.0, the default value is increased to 1024.

It seems Oralce does not strictly follow this threshold, and eventually can cause Shared Pool explosion due to number of child cursor versions.



Test Code (Oralce 11.2.0.3.0):


drop table testt;

create table testt as select 1 id, 'ksun' name from dual;

create table ksun_mutex_sleep as select * from v$mutex_sleep where 1=2;

create or replace procedure ksun_mutex_testp1 as
  l_id       number;
  l_name     varchar2(10) := 'ksun';
  type       tab is table of varchar2(100);
  l_tab_lang tab;
  l_tab_terr tab;
begin
  select value bulk collect into l_tab_lang from v$nls_valid_values where parameter='LANGUAGE';
  select value bulk collect into l_tab_terr from v$nls_valid_values where parameter='TERRITORY';
 
  for j in 1..least(40, l_tab_lang.count) loop
    --execute immediate q'[alter session set nls_language=']'||l_tab_lang(j)||q'[']';
    dbms_session.set_nls('nls_language', ''''||l_tab_lang(j)||'''');
    for k in 1..least(20, l_tab_terr.count) loop
      --execute immediate q'[alter session set nls_territory=']'||l_tab_terr(k)||q'[']';
      dbms_session.set_nls('nls_territory', ''''||l_tab_terr(k)||'''');
      select id into l_id from testt where name = l_name;
    end loop;
  end loop;
end;
/

create or replace procedure ksun_mutex_testp2 as
begin
  for c in (select * from v$sql) loop
    null;
  end loop;
end;
/

create or replace procedure ksun_mutex_test_jobs(p_job_cnt1 number, p_job_cnt2 number)
as
   l_job_id pls_integer;
begin
    for i in 1.. p_job_cnt1 loop
      dbms_job.submit(l_job_id, 'ksun_mutex_testp1;', interval => 'sysdate');
    end loop;
   
    for i in 1.. p_job_cnt2 loop
      dbms_job.submit(l_job_id, 'ksun_mutex_testp2;', interval => 'sysdate');
    end loop;
    commit;
end;   
/


Monday, May 7, 2012

Why shared pool is doubled ?

Problem:


There is an Oracle application which divides the whole work as front-end and back-end. Both are performing the similar tasks. Front-end is for GUI online processing of small and urgent work;
whereas back-end is for large and normal case settled by Oracle background Jobs.

The system is set up with default value of WORKAREA_SIZE_POLICY = AUTO. From time to time, back-end hits error of ORA-04030, i.e. Operating system process private memory has been exhausted. So the developers decided that they should have a better control of *_AREA_SIZE parameters, for example, by regulating HASH_AREA_SIZE, SORT_AREA_SIZE, they can somehow avoid such error. Hence, they decided to put a:

    execute immediate 'alter session set workarea_size_policy = manual';

at the beginning of background Jobs (see Test Code).

For a while, system gets stable and no more ORA-04030 occurs.

Following the progress of development and workload on the system, application gets bigger and bigger (more code too). The consequence is that DBA is obliged to increase shared pool memory with each rollout of new release. For the moment, it is more than 12 GB. However the performance was continuously degraded, probably due to the huge cursor cache and library cache, which overwhelm managing and seeking time for cursors.

By dumping library_cache with:

    alter session set events 'immediate trace name library_cache level 15';

One can see frequent CursorDiagnosticsNodes and AgedOutCursorDiagnosticNodes marked as

    reason=Optimizer mismatch(12) size=2x216
      workarea_size_policy= manual auto
      _smm_auto_cost_enabled= false true


which caused enormous cursor reloads and invalids.

And even worse, there are some SGA memory resizing, that means: Shared Pool steals memory from Buffer Cache.

One can check this resizing by (or check AWR - Memory Dynamic Components):

select component, current_size, user_specified_size, oper_count, last_oper_type
from v$memory_dynamic_components;


The output looks like:

COMPONENT                 CURRENT_SIZE      USER_SPECIFIED_SIZE        OPER_COUNT        LAST_OPER_TYPE
shared pool                        2'785'017'856            2'415'919'104                            5                                GROW
DEFAULT buffer cache    2'936'012'800            3'305'111'552                            5                                SHRINK


The above says same amount(369'098'752 Bytes) stolen by shared pool from DEFAULT buffer cache (v$sga_resize_ops, v$memory_resize_ops also provide the similar information).

During such operation, the entire system looks jammed. It seems well known by Oracle, so they have designated a new hidden parameter: "_memory_imm_mode_without_autosga" (Allow immediate mode without sga/memory target) to prevent such resizing.

MOS also made a clear claim:

set the parameter _MEMORY_IMM_MODE_WITHOUT_AUTOSGA=false (default is TRUE) in the instance to disable this feature with the consequence that in future an ORA-4031 error would be raised.

    alter system set "_memory_imm_mode_without_autosga"=FALSE;

That is also true. Since then the system occasionally hits ORA-04031.

Analysis:


With Oracle 11g Release 2 (11.2.0.2), a new column: REASON is available to reveal the root cause of cursor boost.

select sql_id, optimizer_mismatch
      ,extract (xmltype ('<kroot>' || reason || '</kroot>'), '//workarea_size_policy').getStringVal ()   wasp
      ,extract (xmltype ('<kroot>' || reason || '</kroot>'), '//_smm_auto_cost_enabled').getStringVal () sace
      ,v.*
from v$sql_shared_cursor v
where sql_id in (select sql_id from v$sql_shared_cursor where optimizer_mismatch='Y')
order by v.sql_id, wasp, sace;


It shows that there are paired child cursors for each sql_id due to optimizer_mismatch,
and the reason is workarea_size_policy.

00hxpd7rjygxq       N
  <workarea_size_policy>   auto manual </workarea_size_policy>   
  <_smm_auto_cost_enabled> true false  </_smm_auto_cost_enabled>
00hxpd7rjygxq       Y
  <workarea_size_policy>   manual auto </workarea_size_policy>   
  <_smm_auto_cost_enabled> false  true </_smm_auto_cost_enabled>


Note that a top Node <kroot> is specially appended since reason column can contain more than one <ChildNode> and it requires a top Node to group them before using xpath query (again confirmed that new Features need a long way to go before perfect).

This Blog demonstrates that a tiny unconscious change could provoke a dramatic impact.

Fix:


Remove:
    'alter session set workarea_size_policy = manual' 
and use dynamic code to adjust to back-end jobs.

If we check again "SQL AREA", we find that memory consumption is only half of original:

    select bytes from v$sgastat where name = 'SQLA';

Test Code (Oralce 11.2.0.2 or later):


create table testt as
select level x, rpad ('abc', 100, 'k') y from dual connect by level <= 1000;


create or replace procedure tasks as
  y   number;
begin
  for j in 1 .. 500 loop
    for i in 1 .. 1000 loop
      execute immediate 'select /*' || (rpad ('x', i, 'y')) || '*/ x from testt where x = :i'
      into y using i;
    end loop;
  end loop;
end;
/

create or replace procedure auto_run as
begin
  execute immediate 'alter session set workarea_size_policy = auto';
  tasks;
end;
/

create or replace procedure manual_run as
begin
  execute immediate 'alter session set workarea_size_policy = manual';
  tasks;
end;
/

create or replace procedure start_work (p_job_cnt number) as
  l_job_id   pls_integer;
begin
  for i in 1 .. p_job_cnt loop
      dbms_job.submit (l_job_id, 'begin auto_run; end;');
      dbms_job.submit (l_job_id, 'begin manual_run; end;');
  end loop;
  commit;
end;
/

exec start_work(32);

Tuesday, April 24, 2012

High system time utilization on fast machine

Problem:


Recently we hit some performance problem on a system running a heavy Oracle application. Occasionally when more than 100 "end of day" batch jobs are automatically submitted by a Job control system in the early morning, no more new users can login into Oracle (alert log shows many lines of ORA-609: that is, could not attach to incoming connection).

By only concentrating on Oracle performance, it could not figure out any particular clue.
The already logged in users and applications can still work, Oracle workload is not high.

Then we asked UNIX administrator to send us some snapshots of vmstat(AIX NMON report even better),  immediately it turns out that Runqueues, Kernel thread context switches, system cpu utilization are extremely high in comparing to normal operation.

Here are two entries of vmstat, one is in normal time, another is during batch (problem) time.

kthr  faults         cpu        time



----- ------------  ---------  --------
  r       sy     cs  us sy id   hr mi se
 37   250806   7412  34  2 64   09:53:39  << normal time
185    87914  34261  12 80  8   02:52:44  << batch  time

At the end of this Blog, a complete simplified test codes are appended and can be tested on AIX and Solaris.

Analysis:


The application is running on high end IBM AIX server with many cores(Physical CPUs).
When 100 batch jobs are submitted, most of them can be scheduled to "Running" state,
however, there is a single config file to be read, hence inherently creating a single point of contention (bottleneck). IO is normal not fast as CPU. The faster the CPU, the higher contention.

Here is what AIX default Scheduler (SCHED_OTHER) would do. When a "New" process is inserted into system, it is first put into "Ready". Then scheduler brings it to "Running" state, in this case, it has to wait IO, thus switches to "Preempted" state, and later back again to "Ready" state, eventually waits scheduler to bring into "Running" state again.  Thus it yields CPU to other processes. Since all these processes have the same characteristics,  they will go through many such life cycles. Therefore system ends up spending majority of time in frequent Kernel thread context switches instead of applications work. This phenomenon of "thrashing" causes high system CPU utilization because of single file competing.

The problem occurs only when system baseload (the load when launching batch jobs) achieves certain high.

As tools, vmstat and sar are used for CPU, filemon for file IO. In AIX, nmon report could provide much rich information. By using "ps -efl" (or Berkeley Standards "ps glew"), one can see many <defunct> and <exiting> processes (more than 50 each) during problem time. This is because batchload forks loadconf into background and it quits before children processes terminate, and children processes becomes "orphan".

When monitoring the CPU performance, it is better to check CPU utilization instead of RunQueue since CPU utilization is an average over an interval, whereas RunQueue is a sampling (for example in vmstat),  normally by a frequency of 100 microseconds, however, today’s CPU ticks in nanosecond. Thus RunQueue can not be very precise. For instance, submitting 100 jobs does not mean RunQueue showing 100. In fact, Nyquist sampling theorem requires the sampling should be double fast.

On Solaris, paper (http://www.solarisinternals.com/wiki/images/2/28/Technocrat-util.pdf)
has a profound discussion on run queues and demonstrates how to
tracks the insertion of runnable threads onto CPU run queues with a DTrace script.
One thing I noticed is that "System calls" in vmstat does not necessarily coincide
with Runqueues and context switches. That is still not clear to me.
This Blog also shows that finding the root cause is paramount for Oracle Performance troubleshooting.

Fix:


A quick fix is to insert a sleep after sending each job so that CPU power is mitigated to reduce the IO burden.

Test Codes:

  1.  Script-1 creates a config file, which looks like JAVA's properties file.
  2.  Script-2 reads the config file.
  3.  Script-3 launches N batch jobs in background.
Test can be performed by calling on one UNIX session:
      batchload 100
and at same time running vmstat on a second session.

Script-1 confcreate

#!/bin/ksh
#    usage:  confcreate
count=0
max=1000
configfile="conf"

while [[ $count -lt $max ]]
do
  print "name$count=val$count" >> $configfile
  count=$((count + 1))
done

Script-2 loadconf

#!/bin/ksh
#    usage:  loadconf
count=0
max=1000
configfile="conf"

while [[ $count -lt $max ]]
do
 val=$(awk -F"=" "/^(name$count)=/"'{print $2}' conf)
  count=$((count + 1))
# some Sqlplus script to do the business 
done

Script-3 batchload


#!/bin/ksh
#    usage:   . ./batchload 100
count=0
max=$1

while [[ $count -lt $max ]]
do
 ./loadconf &
 print batchload $count at $(date)
#  sleep 2                            # sleep to relent CPU
  count=$((count + 1))
done

Monday, February 20, 2012

Is latch misses statistic gathered or deduced ?



Correction (2017-Sep-23):


One very experienced Oracle specialist pointed out my error in this Blog. The formula given by Book: Oracle Core: Essential Internals for DBAs and Developers (Page 72–73) is:
       sleeps + spin_gets – misses 
     = recurrent_sleeps
Blog Latch, mutex and beyond has a deep research of above formula.



If you run the first query and see majority of rows with "delta_perc" column less than 1 (1 percent), continue to read this Blog.

select name, misses, sleeps, spin_gets
      ,sign((sleeps + spin_gets) - misses) miss_deduced_sign
      ,((sleeps + spin_gets) - misses)     delta
      ,trunc((abs((sleeps + spin_gets) - misses) / nullif(misses, 0)) * 100 ,2)*100 delta_perc
from v$latch where misses > 1000
order by misses desc;
This Blog is trying to state:
    latch misses statistic is deduced, and not gathered in Oracle kernel code.

By observing very small "perc" value, I assume the equation:
    misses = Sleeps + Spin_gets
approximately holds for most of latches.

If this is a fact, Oracle would not record all the 3 statistics by sacrificing high performance latch algorithm just to verify the above equation.

Further perceiving "miss_deduced_sign" mostly of 0 or 1 (positive), "misses" statistics is probably deduced from (sleeps + spin_gets), not really gathered in the kernel code (so efficiently implemented by the Oracle developers). Probably Oracle records only "sleeps" and "spin_gets" because both signify "get" not succeeded, and will take longer time, so the time to record both would less expensive than the "get" itself (the time to both statistics can be negligible. A disciplined programmer would not let the auxiliary actions take more time than real actions). The not popular "miss_deduced_sign" of 1 could be explained due to update latency on "misses" by (sleeps + spin_gets). As observed, the not popular "miss_deduced_sign" of -1 or 1 would appear when system is heavily loaded, once system is stable, "miss_deduced_sign" would tentatively return back to 0. This is also a good optimization, function first, statistics later.
   
A large minus "delta" (miss_deduced_sign = -1) could be a residual from DB start up since they stay almost constant.

Some books contain pseudo-codes to explain the Oracle latch algorithm, I hope they can match the aforementioned equation (see Book Expert Oracle Database Architecture Page 222, and Oracle Performance Firefighting Page 79).