Friday, April 12, 2013

cursor: pin S wait on X

Oracle introduced mutex (replacing latch) to protect cursor and library cache, and consequently following wait events:

   cursor: mutex S
   cursor: mutex X

   cursor: pin S
   cursor: pin X
   cursor: pin S wait on X

   library cache: mutex S
   library cache: mutex X


Staring in perplexity, I wonder why Oracle conceived such an asymmetrical third wait event for "cursor: pin":

      cursor: pin S wait on X

but not for "cursor: mutex" and "library cache: mutex".

A recent session crash dump shed some light on this deliberate design.

Session Wait History shows that it is desperately waiting for 'cursor: pin S wait on X'.

  waited for 'cursor: pin S wait on X'                               
  idn=0xb9a7f11c, value=0x1fef00000000, where=0x500000000
  wait_id=26539 seq_num=26582 snap_id=1
  wait times: snap=0.010060 sec, exc=0.010060 sec, total=0.010060 sec
  wait times: max=infinite
  wait counts: calls=2 os=2
  occurred after 0.014483 sec of elapsed time


pick the idn value, and search further in the dump file, we find the waited object.

  LibraryHandle:  Address=70000521ff2b1e8 Hash=b9a7f11c LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
   ObjectName:  Name=SELECT NAME FROM
     FullHashValue=28a0bd52de76008db45a3f29b9a7f11c Namespace=SQL AREA(00) Type=CURSOR(00)
     Children: 
       Child:  childNum='0'
         LibraryHandle:  Address=70000521f025fc0 LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
           Name:  Namespace=SQL AREA(00) Type=CURSOR(00)
       Child:  childNum='1'
         LibraryHandle:  Address=70000521d0eeeb0 LockMode=N PinMode=X LoadLockMode=0 Status=VALD
           Name:  Namespace=SQL AREA(00) Type=CURSOR(00)
       Child:  childNum='2'
         LibraryHandle:  Address=70000521dec3790 LockMode=0 PinMode=0 LoadLockMode=0 Status=VALD
           Name:  Namespace=SQL AREA(00) Type=CURSOR(00)
           ObjectFreed=last freed from HPD addn data CBK
       Child:  childNum='3'
         LibraryHandle:  Address=70000521adcce68 LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
           Name:  Namespace=SQL AREA(00) Type=CURSOR(00)

          
A query on dba_hist_active_sess_history reveals that only this session is with SQL_CHILD_NUMBER = 2, all other sessions are with SQL_CHILD_NUMBER = 1. Before and after this session crashes(exit), other sessions are still running.

Look carefully at childNum='1', we can see this child cursor marked with:

    PinMode=X
          
Now a possible interpretation for 'cursor: pin S wait on X' would be:

   When Oracle wants to "pin S" a cursor, it sequentially traverses all child cursors under the same parent.
   If it hits a cursor marked with: PinMode=X before it reaches the matched cursor, it posts:

        cursor: pin S wait on X


1. Stack Trace


Pick UNIX pid of blocked Oracle sessions waiting on Event: "cursor: pin S wait on X", make a stack trace.
Here the outputs on both AIX and Solaris, where the line:
    kkscsSearchChildList
probably indicates the sequential examination along all child cursors.

oracle@aix > procstack 64815340
 64815340: oracledb (LOCAL=NO)
 __fd_select(??, ??, ??, ??, ??) + 0xbc
 select(??, ??, ??, ??, ??) + 0x14
 skgpnap(??, ??, ??, ??) + 0x16c
 skgpwwait(??, ??, ??, ??, ??) + 0x204
 kgxWait(??, ??, ??, ??, ??, ??) + 0x1b8
 kgxModifyRefCount(??, ??, ??, ??) + 0x288
 kgxSharedExamine(??, ??, ??) + 0x3c
 kxsGetRuntimeLock(??, ??, ??, ??, ??) + 0x144
 kkscsCheckCursor(??, ??, ??, ??, ??, ??, ??, ??) + 0x1ec
 kkscsSearchChildList(??, ??, ??, ??, ??, ??, ??, ??) + 0x7b8
 kksfbc(??, ??, ??, ??, ??, ??, ??, ??) + 0x51f4
 kkspsc0(??, ??, ??, ??, ??, ??, ??) + 0x8b0
 kksParseCursor(??) + 0x78
 opiosq0(??, ??, ??, ??, ??) + 0x7f4
 kpooprx(0x7000100fb4086a0, 0x0, 0x0, 0x100000001, 0x0, 0xa4000000a4, 0x0) + 0x16c
 kpoal8(??, ??, ??) + 0x1bc
 opiodr(??, ??, ??, ??) + 0x3b4
 ttcpip(??, ??, ??, ??, ??, ??, ??, ??) + 0x34c
 opitsk(??, ??) + 0x2224
 opiino(??, ??, ??) + 0xb98
 opiodr(??, ??, ??, ??) + 0x3b4
 opidrv(??, ??, ??) + 0x414
 sou2o(??, ??, ??, ??) + 0xf0
 opimai_real(??, ??) + 0x1c0
 ssthrdmain(0x0, 0x200000002, 0xfffffffffffed18) + 0x1c4
 main(??, ??) + 0xd0
 __start() + 0x70


oracle@slaris :/export/home/oracle > pstack 1098
 pollsys  (fffffd7fffdf5860, 0, fffffd7fffdf58f0, 0)
 pselect () + 154
 select () + 72
 skgpwwait () + 168
 kgxWait () + 216
 kgxModifyRefCount () + 15e
 kgxSharedExamine () + 3b
 kxsGetRuntimeLock () + 137
 kkscsCheckCursor () + 22a
 kkscsSearchChildList () + 58f
 kksfbc () + 915
 kkspsc0 () + 9d7
 kksParseCursor () + 74
 opiosq0 () + 70a
 kpooprx () + 102
 kpoal8 () + 308
 opiodr () + 433
 ttcpip () + 593
 opitsk () + 6cc
 opiino () + 3c5
 opiodr () + 433
 opidrv () + 33b
 sou2o () + 7a
 opimai_real () + 265
 ssthrdmain () + 22e
 main () + a4
 ???????? ()

2. Mutex Sleep Reason


We have a SQL statement, whose parsing takes about 11 minutes. If we start two sessions to parse the same statement. One session is parsing, another is waiting for Event: cursor: pin S wait on X.

Collect AWR report for the duration of 11 minutes, Section: Mutex Sleep Summary shows:

Mutex Type Location Sleeps Wait Time (ms)
Cursor Pin kkslce [KKSCHLPIN2] 61,421 616,083
Library Cache kglhdgn2 106 1 0
Library Cache kglpnal1 90 1 0

We can see the 11 minutes of 616,083 Wait Time (ms), Sleeps of 61,421.
That is probably due to Scheduler clock tick frequency of 10 milliseconds, approximately the equation:
    61,421 x 10 = 616,083

The above observation seems conforming to MOS (Doc ID 1268724.1):
   "Cursor: Pin S Wait On X" Contention Mutex Sleep Reason Primarily ' kkslce [KKSCHLPIN2]'


3. cursor: pin S wait on X    Simulation


Setup test:

drop table testt;

create table testt as select 1 id, 'Test' name from dual;

Open a Sqlplus Session,

SQL (72) > select sid, pid, spid from v$session s, v$process p where s.paddr=p.addr and s.sid = sys.dbms_support.mysid;

 SID  PID  SPID
---- ---- -----
  72   34  1122

SQL (72) > alter system flush shared_pool; 

Pick one kks subroutine, for instance, kksFullTypeCheck. On Solaris, execute:

$ sudo dtrace -w -n 'pid$target::kksFullTypeCheck:entry {@[pid, ustack(10, 0)] = count(); stop(); exit(0);}' -p 1122  

On above Sqlplus Session, run:

SQL (72) > select id, name from testt where id = 1;

The Session is stopped.

Open a second Sqlplus Session, and run the same query:

SQL (83) > select sid, pid, spid from v$session s, v$process p where s.paddr=p.addr and s.sid = sys.dbms_support.mysid;

 SID  PID  SPID
---- ---- -----
 83    45  6677

SQL (83) > select id, name from testt where id = 1;

SID 83 is blocked.

Observe the "cursor: pin S wait on X" event by:

select sid, osid, chain_signature, blocker_is_valid, wait_event_text, in_wait_secs, num_waiters
from v$wait_chains where sid in (72, 83);

To unblock, execute:
  $ prun 1122


4. Slow Parsing in multiple Sessions


Further tests show that if a slow parsing takes x seconds for one statement in one single session, concurrently parsing it in n sessions will take a total elapsed time of (n*(n-1)*x + x) seconds for all sessions, even worse if CPU is saturated. So in average, that is about (n-1)*x seconds per session (assume n is relatively big). The main wait event is "cursor: pin S wait on X" among the concurrent sessions.

One workaround to get off the racing condition is to make the statement passed to each session semantic equivalent, but syntax (sql text) different, for example, adding different dummy hint, or appending different constant to "order by" clause in each session's statement. The downside is more space in shared pool.

Shared Pool - Free Memory, Fragmentation ?

Facing:
    ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","SELECT MAX(XX) FROM...","SQLA","tmp")

under shared pool memory distribution of:

Top 5 Areas Name Subpool_1 Subpool_2 Subpool_3 Subpool_4 Subpool_5 Subpool_6 Subpool_7 Sum
KKSSP 617'693 563'340 590'296 590'798 577'532 643'504 577'237 4'160'404
db_block_hash_buckets 529'039 535'429 529'039 539'525 535'429 529'044 535'429 3'732'935
KGLH0 554'532 485'312 464'006 353'634 450'528 346'045 332'455 2'986'514
SQLA 1'110 588'943 565'317 185'664 574'008 155'498 272'452 2'342'994
free memory 306'562 358'029 353'146 306'659 342'194 325'386 304'535 2'296'513

(see Blog: Shared Pool - Memory Allocation - unbalanced)

an intuitively skeptical question is:
               Why do I get ORA-04031 even though there is plenty of free memory (> 10%) ?
One unsubstantial reply is mostly: memory fragmentation (or memory leak).

This Block is trying to disperse such fashionable pretext.

A simple way to dispute it is to make a heapdump and you will see FREE LISTS starting with size=32 in Bucket 0:
   FREE LISTS:
   Bucket 0 size=32

so the minimum memory chunk size in shared pool is 32 Byte, any free memory chunk is at least 32 Byte.

But above error says:
   unable to allocate 32 bytes
evidently there is truly no more free memory.

So where is the secrecy behind the controversial information ?

To understand it, firstly we will look at what free memory is in different aspect, then expose the memory allocation of certain popular areas (subheaps), and finally debate further on this theme.

1. Free Memory


Take a DB of shared_pool_size = 1408M with two Subpools, one direct way to get free memory statistics is:
   select round(bytes/1024/1024) mb from v$sgastat where pool='shared pool' and name = 'free memory';
returns:
 298
 
or on its internal x$ table:
  select ksmdsidx, round(ksmsslen/1024/1024) MB from x$ksmss where ksmssnam ='free memory';

returns:
  ksmdsidx  MB
 -------------
 0         208
 1         42
 2         48

which details free memory per subpool, where ksmdsidx 0 denotes the "RESERVED EXTENTS" in "sga heap" top.

A heapdump shows:

 HEAP DUMP heap name="sga heap"  desc=700000000000198
    reserved granule count 13 (granule size 16777216)
  RESERVED EXTENTS
 HEAP DUMP heap name="sga heap(1,0)"  desc=700000000052a48
  FREE LISTS
   Total free space   =  9970088
  RESERVED FREE LISTS
   Total reserved free space   = 29473232
  
 HEAP DUMP heap name="sga heap(2,0)"  desc=70000000005c310
  FREE LISTS
   Total free space   = 12790056
  RESERVED FREE LISTS
   Total reserved free space   = 31900768

ksmdsidx 0 is correlated well with heapdump, but not free memory in both subpools.

 round(16777216*13/1024/1024)         = 208
 round((9970088+29473232)/1024/1024)  =  38
 round((12790056+31900768)/1024/1024) =  43

(In heapdump, summing all chunks commented with "R-free" and "free" gives the same result)

So now the question is why v$sgastat reports more free memory than heapdump ?

Let's try to find out it.

2. Subheaps: KGLH0, KGLS, PLDIA, SQLA, SQLK, SQLP


These subheaps are chosen since they are mostly allocated in a chunk size of 4096, and all together play a prevalent role in memory consumption (On the top of Page 157 of Book: Expert Oracle Database Architecture, there is one line:
  The shared pool is characterized by lots of small (generally 4KB or less) chunks of memory).

At first select them from v$sgastat:

 select name, round(bytes/1024/1024) mb
   from v$sgastat
  where pool='shared pool'
    and name in ('KGLH0', 'KGLS', 'PLDIA', 'SQLA', 'SQLK', 'SQLP')
 order by mb desc;

 NAME   MB
 -----------
 SQLA   219
 KGLH0  125
 KGLS   43
 PLDIA  9
 SQLP   5
 SQLK   0

another query on x$ksmsp displays more details about memory usage:

 select substr(ksmchcom, 1, 5)          name
       ,round(sum(ksmchsiz)/1024/1024) mb
       ,count(ksmchsiz)                cnt
       ,round(avg(ksmchsiz))           avg
       ,min(ksmchsiz)                  min
       ,max(ksmchsiz)                  max
   from x$ksmsp v
 where substr(ksmchcom, 1, 5) in ('KGLH0', 'KGLS^', 'PLDIA', 'SQLA^', 'SQLK^', 'SQLP^')
 group by substr(ksmchcom, 1, 5)
 order by mb desc;

 NAME   MB   CNT   AVG  MIN  MAX 
 ---------------------------------
 SQLA^  220 56269 4099 4096 28664
 KGLH0  126 32261 4096 4096 4640
 KGLS^  43  11005 4096 4096 4128
 PLDIA  9   2195  4096 4096 4112
 SQLP^  5   1209  4168 4096 5888
 SQLK^  0   19    4096 4096 4096

Look again on the same heapdump, extract all lines containing one special comment, say,

KGLH0^d020e92f
  Chunk  7000000a1cd3578 sz=     4096    recreate  "KGLH0^d020e92f "  latch=0
  Chunk  7000000a1d9c4d8 sz=     4096    recreate  "KGLH0^d020e92f "  latch=0
  Chunk  7000000a1efce68 sz=     4096    freeable  "KGLH0^d020e92f "  ds=7000000a1d9c450
  Chunk  7000000a7b8f880 sz=     4096    freeable  "KGLH0^d020e92f "  ds=7000000a1d9c450
  Chunk  7000000a7f07588 sz=     4096    freeable  "KGLH0^d020e92f "  ds=7000000a1d9c450
 
and then make an addr dump of KGLH0^d020e92f:  

 Processing Oradebug command 'dump heapdump_addr 1 0X7000000A1D9C450'
 HEAP DUMP heap name="KGLH0^d020e92f"  desc=7000000a1d9c450
 EXTENT 0 addr=7000000a1efce80
   Chunk  7000000a1efce90 sz=     1384    perm      "perm           "  alo=600
   Chunk  7000000a1efd3f8 sz=      152    freeable  "kgltbtab       "
   Chunk  7000000a1efd490 sz=      696    freeable  "policy chain   "
   Chunk  7000000a1efd748 sz=      144    freeable  "context chain  "
   ...
 EXTENT 1 addr=7000000a7f075a0
   Chunk  7000000a7f075b0 sz=      376    free      "               "
   ...
 EXTENT 2 addr=7000000a7b8f898
   Chunk  7000000a7b8f8a8 sz=     1760    perm      "perm           "  alo=1760
   Chunk  7000000a7b8ff88 sz=      416    free      "               "
   ...
 EXTENT 3 addr=7000000a1cd35a8
   Chunk  7000000a1cd35b8 sz=       80    perm      "perm           "  alo=80
   Chunk  7000000a1cd3608 sz=     3296    perm      "perm           "  alo=3296
   Chunk  7000000a1cd42e8 sz=       48    free      "               "
   ...
 Total heap size    =    16200
 FREE LISTS:
  Bucket 0 size=0
   Chunk  7000000a7b8ff88 sz=      416    free      "               "
   Chunk  7000000a7f075b0 sz=      376    free      "               "
   Chunk  7000000a1cd42e8 sz=       48    free      "               "
   Chunk  7000000a1cd35d8 sz=        0    kghdsx
 Total free space   =      840
 UNPINNED RECREATABLE CHUNKS (lru first):
 PERMANENT CHUNKS:
   Chunk  7000000a1efce90 sz=     1384    perm      "perm           "  alo=600
   ...
 Permanent space    =     6520

so total used memory is 16200 (including 840 free space, and 6520 Permanent space), however the allocated space is 5*4096 = 20480, so a 20480 - 16200 = 4280 over allocation.

Make a cross-check with the same heapdump, we see no free Chunks in Bucket 43(size=376) and 48(size=416):

HEAP DUMP heap name="sga heap(1,0)"  desc=700000000052a48
 FREE LISTS:
  Bucket 42 size=368
  Bucket 43 size=376
  Bucket 44 size=384
  Bucket 45 size=392
  Bucket 46 size=400
  Bucket 47 size=408
  Bucket 48 size=416
  Bucket 49 size=424
HEAP DUMP heap name="sga heap(2,0)"  desc=70000000005c310
 FREE LISTS:
  Bucket 42 size=368
  Bucket 43 size=376
  Bucket 44 size=384
  Bucket 45 size=392
  Bucket 46 size=400
  Bucket 47 size=408
   Chunk  7000000a3e8c3d8 sz=      408    free      "               "
  Bucket 48 size=416
  Bucket 49 size=424

 
so the FREE LISTS in addr dump are not counted in free memory of heapdump, that probably explains why the free memory of heapdump is less than that of v$sgastat. Since they are not listed in any FREE LISTS, they can not be allocated to any later memory demand till their bound parental chunks returned to LRU LIST (UNPINNED RECREATABLE CHUNKS). I guess that Oracle takes such an approach to reduce memory fragmentation (and coalescing) in hoping that each allocated chunk is fully used in these subheaps. In summary, free memory in v$sgastat probably displays more than allocable free memory since it includes un-allocable free memory inside the allocated chunks(chunk-intra free memory).

3. Discussion


3.1. session_cached_cursors: KGLH0 and SQLA


Each Oracle object, whatever SQL cursor(both parent and child), Table, PL/SQL, has one KGLH0 memory allocation of multiple 4K. When session_cached_cursors is set, repeated parse calls (more than 3 times) of the same SQL (including recursive SQL) or PL/SQL statement by any session connected to DB are candidate for addition to session cursor cache (Understanding Shared Pool Memory Structures ).

Cursor in this cache is "partially pinned", that means, KGLH0(heap 0) is pinned, but not SQLA (heap 6). So when session_cached_cursors is set high, there will be potentially huge KGLH0 pinned, but not SQLA.

As a certain rule of thumb, SQLA should be proportional to KGLH0, otherwise they are out of originally designed range.

In fact, the DB which throws ORA-04031 has set session_cached_cursors = 600 (Oracle Default=50).
(see Blog: Shared Pool - Memory Allocation - unbalanced) And the memory statistics shows that KGLH0 is the third top memory consumer in all subpools, and SQLA in subpool_1 is extremely problematic. With 6000 concurrently connected sessions and session_cached_cursors = 600, there will be 3'600'000 pinned KGLH0 in the extreme case (this pure math number will not appear since they are not all different).

Continuing with our experiment, a library_cache dump("level 16") confirms that there is always one KGLH0 for each cursor (parent and child), and occasionally one SQLA, so more KGLH0 subheaps than SQLA subheaps. Obviously this is already an optimization since normally SQLA requires more memory than its corresponding KGLH0.

Library_cache dump also shows certain information about memory usage, for example, a particular SQLA:

   Block:  #='6' name=Permanent space^580dee70 pins=0 Change=NONE
     Heap=7000000a0117ec8 Pointer=7000000a49c0040 Extent=7000000a49bf3e8 Flags=I/-/-/A/-/E
     FreedLocation=0 Alloc=23.171875 Size=23.742188 LoadTime=0

To find out what "Alloc" and "Size" denote, make an addr heapdump for Heap=7000000a0117ec8, then you will see following lines:

  dump heapdump_addr 1 0X7000000a0117ec8
    Total heap size    =    24312
    Total free space   =      504
    Permanent space    =       80

and then try to relate them together:

   Total heap size = Size  = 23.742188*1024 = 24312
   Total free space + Permanent space = 23.742188*1024 - 23.171875*1024 = 584 = 504 + 80
  
so "Size" seems about total allocated memory in Byte, but I wonder why "Permanent space" is excluded from "Alloc".

When applying the same calculation for KGLH0, it seems that only "Size" matches, but not "Alloc".
(Both names: "Alloc" and "Size" seems also reversely used)

In the dump file when memory of shard_pool being under pressure (ORA-04031), probably there are certain offending cursors with noticeable KGLH0 or SQLA memory Size, for example: KGLH0^ac7e9a16 consumes about 131'343'512 Bytes.

DataBlocks:  
  Block:  #='0' name=KGLH0^ac7e9a16 pins=0 Change=NONE   
    Heap=700014eb0a10b30 Pointer=7000150fedd92e0 Extent=7000150fedd9170 Flags=I/-/P/A/-/-/- 
    FreedLocation=0 Alloc=128010.093750 Size=128265.148438 LoadTime=16455656254 

3.2. v$sgastat(x$ksmss) vs. x$ksmsp


In v$sgastat, Bytes for "free memory" statistics is really allocable free memory (top RESERVED EXTENTS, and FREE LISTS, RESERVED FREE LISTS in each subpool) plus un-allocable chunk-intra free memory. And Bytes for each subheap is the effectively occupied memory.

Whereas x$ksmsp reports allocated memory, but in more details, for example, KGLH0 for each cursor. It also outputs one special row for "permanent memor". But it enumerates less rows than v$sgastat.

In conclusion, this chunk-intra free memory probably irritates such a confusion:
              I get ORA-04031 even though there is plenty of free memory (> 10%).

With following query, you can get a rough comparison of both aspects. By only running the subquery_factoring_clause, you can see that above mentioned subheaps:
    KGLH0, KGLS, PLDIA, SQLA, SQLK, SQLP
are mostly allocated in 4K, and there are also more KGLH0 than SQLA due to session_cached_cursors.

with x as
(
select substr(ksmchcom, 1, decode((instr(ksmchcom, '^') - 1), -1,
                           length(ksmchcom), (instr(ksmchcom, '^') - 1))) name
      ,sum(ksmchsiz) bytes
      ,round(sum(ksmchsiz)/1024/1024) mb
      ,count(ksmchsiz) cnt
      ,round(avg(ksmchsiz)) avg
      ,min(ksmchsiz) min
      ,max(ksmchsiz) max
  from x$ksmsp v
group by substr(ksmchcom, 1, decode((instr(ksmchcom, '^') - 1), -1,
                             length(ksmchcom), (instr(ksmchcom, '^') - 1)))
order by mb desc
)
select v.name                            v_name
      ,v.bytes                           v_bytes
      ,round(v.bytes/1024/1024)          v_mb
      ,nvl(v.bytes, 0) - nvl(x.bytes, 0) delta
      ,x.*
from (select name, bytes from v$sgastat where pool='shared pool') v
     full outer join                                              x
  on lower(v.name) = lower(x.name)
order by abs(delta) desc nulls last;