Wednesday, November 7, 2018

Latch: row cache objects Contentions and Scalability (V)


(I)-Tracing Methods      (II)-Object Type(Part-1)      (III)-Object Type(Part-2)       (IV)-Sql Executions (IV)      (V)-Contentions and Scalability


This Blog will test the scalability of row cache objects discussed in previous 4 Blogs:
  1. Oracle row cache objects Event: 10222, Dtrace Script (I)
  2. Row Cache Objects, Row Cache Latch on Object Type: Plsql vs Java Call (Part-1) (II)
  3. Row Cache Objects, Row Cache Latch on Object Type: Plsql vs Java Call (Part-2) (III)
  4. Row Cache and Sql Executions (IV)
and then attempt to build a mathematical model to assess such system.

Empirical data and model predication demonstrate that such system is hardly to be scalable with the number of CPUs because of Latch contentions.

In a pure row cache object system, theoretically, 3 Processors (CPU); experimentally, 9 Processors are necessitated to (almost) achieve the maximum throughput. Extra Processors can only put up high contentions.

The possible alternatives could be using Oracle basic (Built-in) data types or object string serialization.

Finally, we are trying to show that "row cache objects Latch" falls into the same destiny as the single "Result Cache: RC Latch" discussed in Blog: PL/SQL Function Result Cache Invalidation (I)

As a secondary earning of this exercise, "library cache: mutex X" contentions are observed related to Oracle Object Types.

Note: All tests are done in Oracle 12.1.0.2.0 on Solaris, Linux, AIX (SMT 4, LCPU=24) with 6 physical processors.


1. Test


We will make the test based on the code of Blog: Row Cache Objects, Row Cache Latch on Object Type: Plsql vs Java Call (Part-1) (II)


1.1. Plsql Test


Run (see appended Test Code: Plsql):
exec foo_proc_sessions_ctrl(1e9, 10);
which performs dynamic Plsql Call (foo_proc) for 9 different parallel degrees, each of which is running for 10 minutes.

Then collect the test data by query:
select * from exec_stats_v;
Table below shows number of parallel (concurrent) sessions, total number of executions, cpu_time in seconds, concurrency_wait_time in seconds, and microseconds per execution.
exec_stats_v (Plsql SQL_ID: 241pfd82cu4wj)

SESSIONS   EXECUTIONS CPU_TIME_S CONCURRENCY_WAIT_TIME_S US_PER_EXEC
--------   ---------- ---------- ----------------------- -----------
Solairs        
       1    8,096,318        413                       0          51
       3   21,850,342      1,212                      45          57
       6   32,609,592      1,923                     520          75
      12   42,217,169      2,555                   1,693         125
      18   42,890,535      2,615                   3,548         193
      24   42,955,245      2,631                   5,667         265
      36   43,117,305      2,633                   9,372         406
      42   42,523,671      2,616                  11,551         480
      48   40,546,142      2,506                  13,915         561 

AIX            
       1    6,535,357        251                       0          65
       3   17,355,458        714                      75          74
       6   21,327,093      1,069                     664         117
      12   24,797,761      1,353                   1,814         198
      18   25,037,784      1,740                   2,992         297
      24   20,691,357      2,597                   2,537         542
      36   19,048,665      2,945                   7,872         963
      42   19,199,984      2,872                  11,683       1,132
      48   19,214,011      2,815                  15,315       1,306
               
Linux          
       1   13,992,932        413                       0          30
       3   32,715,487      1,216                      44          39
       6   52,665,062      2,222                     316          48
      12   49,743,420      2,146                   2,321          98
      18   50,448,264      2,249                   4,525         149
      24   50,836,921      2,340                   6,895         201
      36   51,864,133      2,458                  12,002         307
      42   48,914,411      2,495                  14,942         390
      48   48,549,375      2,535                  17,989         460         
The above result shows:
  Throughput (EXECUTIONS) from 1 to 3 SESSIONS climbs approximately linearly; of 6 SESSIONS tends to be flat or descending.
  Max throughput is achieved with around 9 parallel Sessions.
  The performance is saturated by more than 12 parallel Sessions. 
  The response time per execution (US_PER_EXEC) gets increased, probably due to the Latch contentions
    (sessions spend more time on Latch Gets instead of real work).
During the test, we also collect the data for Row Cache Object Gets, and Row Cache Latch Gets.

Here the output for parallel sessions 12 and 42.
select * from rco_stats_v where sessions in (12, 42) order by child#;

rco_stats_v

SESSIONS  CHILD#  RC_PARAMETER      GETS       MISSES    SLEEPS  SPIN_GETS  WAIT_TIME_S  RC_GETS    RC_GETMISSES  RC_THEORY_GETS  CACHE#_LIST  PARAMETER_LIST
--------  ------  ----------------  ---------  --------  ------  ---------  -----------  ---------  ------------  --------------  -----------  ----------------------------
   12       8     dc_users          633984132  19649121  52958   19596500   262          211328141  0             1901952396      7, 10        dc_users, dc_users
   12       9     dc_object_grants  633984845  54784381  58712   54726007   292          211328352  0             1901954535      8, 8         dc_objects, dc_object_grants
   12       10    dc_global_oids    633982830  52451846  68813   52383487   346          211327687  0             1901948490      17           dc_global_oids
            
   42       8     dc_users          638662605  22653084  56234   22597401   2274         212887753  47            1915987815      7, 10        dc_users, dc_users
   42       9     dc_object_grants  638674026  56604969  65236   56540334   3006         212891172  552           1916022078      8, 8         dc_objects, dc_object_grants
   42       10    dc_global_oids    638657482  64387264  76460   64311564   3905         212885997  30            1915972446      17           dc_global_oids
Look the case of 42 SESSIONS, although all 3 Latch Gets are similar, dc_global_oids suffers most, then dc_object_grants, and dc_users least. They are exposed by the different MISSES, SLEEPS, SPIN_GETS, WAIT_TIME. We will use this info in the later causality modeling.

Note Latch GETS is 3 times of RC_GETS.


1.2. Java Test


Run appended shell script:
ksh ./foo_proc_sessions_java 600
which invokes Java (RCOObjTypeJDBC1) to call foo_proc for 9 different parallel degrees, each of which is running for 10 minutes, same as above Plsql Test.

Then collect the similar data for Java Call as Table below.

exec_stats_v (Java SQL_ID: 8nkt65d7hcnxz)

SESSIONS EXECUTIONS CPU_TIME_S  CONCURRENCY_WAIT_TIME_S US_PER_EXEC
-------- ---------- ----------  ----------------------- -----------
Solairs                                                            
      1   3,790,741         294                       0          77
      3   9,154,792         774                     112          99
      6  13,334,385       1,213                     631         149
     12  16,216,973       1,565                   1,711         264
     18  16,971,173       1,720                   2,622         400
     24  17,062,335       1,745                   3,962         551
     36  16,616,651       1,735                   6,995         925
     42  16,492,514       1,733                  10,003       1,117
     48  16,264,599       1,721                  13,106       1,323

AIX                                                                
      1   3,288,798         166                       0          86
      3   8,965,468         498                      39          99
      6  11,661,246         701                     221         148
     12  14,770,591       1,186                     972         274
     18  14,791,463       1,715                   1,534         459
     24  13,030,697       2,220                   1,727         749
     36  12,096,987       2,211                   5,475       1,173
     42  11,845,242       2,090                   8,166       1,388
     48  11,732,373       2,080                  11,252       1,661
                                                                   
Linux                                                              
      1   4,788,119         197                       0          50
      3  12,325,022         647                      22          66
      6  20,053,654       1,233                     261          91
     12  20,646,925       1,050                   1,518         156
     18  20,118,792       1,016                   3,261         257
     24  19,740,172       1,148                   5,525         393
     36  19,285,554       1,229                  10,114         662
     42  19,069,527       1,226                  12,541         806
     48  18,363,626       1,194                  15,031         978                                                                  
The result shows that throughput (EXECUTIONS) is about 2 times lower than Plsql, and response time (US_PER_EXEC) is about 2 times higher than Plsql.

In Blog: Row Cache Objects, Row Cache Latch on Object Type: Plsql vs Java Call (Part-1) (II), we see that each dynamic Plsql foo execution requires 45 Row Cache Latch Gets, whereas that of Java is 63. About 40% more Latch Gets in Java, but its throughput is about 2 times lower than Plsql.


2. Modeling


Let's try to model 3 Row Cache Object (CID: 17, 11, 7) Gets, and respective Latch Gets, each of which contains 3 latch Locations ("Where").

In the following discussion, we model:
  Oracle Session          as Job
  Row Cache Object Get    as Task
  Task Processing Server  as Machine
Suppose we have one Workshop W, and n Job Generators:
  G = (G1, G2, ... ,Gn)
each holds one single Job in each instant (one terminated, a new produced).

Each Job is made of 3 Tasks (sub-Jobs):
  J_i = (Si_1, Si_2, Si_3)
Each Task consists of 3 Work Units:
  Si_1 = (Si_1_u1, Si_1_u2, Si_1_u3)
  Si_2 = (Si_2_u1, Si_2_u2, Si_2_u3)
  Si_3 = (Si_3_u1, Si_3_u2, Si_3_u3)
They are subject to constraints:
  All 3 Tasks in each Job have to be processed sequentially.
  All 3 Work Units in each Task have to be processed sequentially.
  The Tasks and Work Units among different Jobs can be in parallel.
The Workshop is equipped with an assembly line, which consists of 3 Machines (Processors):
  W = (M1, M2, M3)
Each Machine possesses 3 Processing Units:
  M1 = (p1_1, p1_2, p1_3)
  M2 = (p2_1, p2_2, p2_3)
  M3 = (p3_1, p3_2, p3_3)
3 Machines are dedicated respectively for 3 Tasks:
  M1 exclusively processes Si_1
  M2 exclusively processes Si_2
  M3 exclusively processes Si_3
M1, M2, M3 are running in parallel (inter-parallel); but Machine'3 processing Units are hardly running in parallel (no intra-parallel).

The service time of 3 Machines are:
  t1 for M1 to process Si_1
  t2 for M2 to process Si_2
  t3 for M3 to process Si_3 
So minimum processing time of each Job is (t1 + t2 + t3).

Let's look the processing of first n Jobs.

Assume t1 < t2 <t3 (later empirical data support such presumption), after i-th Job being processed, there are:
  (n-i)*(t2-t1)/t1 
Jobs waiting before M2 after M1; There are:
  (n-i)*(t3-t2)/t2  
Jobs waiting before M3 after M2 when i-th Job being processed by M3.

So M3 processing i-th Job caused a delay of
  (t3-t2)
for
  (n-i)*(t3-t2)/t2 
Jobs, that means an accumulation delay by i-th Job (J_i) is:
  (t3-t2) * (n-i) * (t3-t2)/t2    1 <= i <= n 
The total accumulation waiting time before M3 for processing all n Jobs is:
   (t3-t2) * (n*(n-1)/2) * (t3-t2)/t2
  = n*(n-1) * (t3-t2)^2 / (2 * t2)       
In case of t1 < t2 < t3, all Jobs waiting before M3. If we only consider the fastest machine M1 and slowest machine M3, i.e. min(t1, t2, t3) = t1, max(t1, t2, t3) = t3, M1 is first machine, M3 is last one, both are running in parallel, The accumulation waiting time inside whole Workshop can be approximately estimated as:

Average waiting time per Job is:
  (n-1) * (t3-t1)^2 / (2 * t1)
Put all together, average response time can be expressed as:
    avg_response_time (rt) = waiting_time (wt) + service_time (st)
                           = (n-1) * (t3-t1)^2 / (2 * t1) + t3
    
    throughput             = (1 / rt) 
                          <= (1 / t3) 
Therefore,
  total response time is quadratically to n.
  average response time is linearly to n, which is similar to test result.
  maximum throughput is 1 / avg_response_time.    
Take the example 10222 trace file of Solairs at the beginning of this section:
  t1 = 255 us   for cid = 17
  t2 = 324 us   for cid = 11 
  t3 = 410 us   for cid =  7 (covered by cid = 11)
    
  t1 + t2 + t3 = 989 (255+324+410) = sum
The time collected in 10222 trace is much bigger than Solaris US_PER_EXEC for one single session (SESSIIONS=1), probably because overhead of 10222 tracing. So we first convert them to the number without tracing (all are commented after "->").
  US_PER_EXEC = 51  -- Solaris SESSIONS=1 
  
  t1 + t2 + t3 = 989 (255+324+410) = sum -> 51
  
  t1 = 255   -> 51 * 255/989 = 13
  t2 = 324   -> 51 * 324/989 = 17    
  t3 = 410   -> 51 * 410/989 = 21
Now we can try to make the calculation for 12 SESSIONS (n=12). The accumulation waiting time inside entire Workshop is:
  n*(n-1) * (t3-t2)^2 / (2 * t2)        
  = 12*11*(21-13)*(21-13)/(2*13) = 324
Average waiting time for each Job:
  (n-1) * (t3-t2)^2 / (2 * t2) 
  = 11*(21-13)*(21-13)/(2*13) = 27
Average response time for each Job:
  27 + 21 = 48  
In the above discussion, Job is modelled by 2 Layers:
  Task
  Work Unit
Workshop is also modelled by 2 Layers correspondingly:
  Machine 
  Processing Units
Till now, we only assess the first Layer. The second Layer have not yet been taken into account:
  Work Units 
  Processing Units 
Another deficiency is that there are n Job Generators, and each Generator can produce next Job once previous Job is terminated (boundary condition), so there are total n Jobs in the system, but above model only consider one time of processing first n Jobs.

Back to Row Cache, the scope of model is still too far to reflect the real system because we only consider inter-parallel, but no intra-parallel. This is not precise because intra-parallel is related to 2nd and 3rd latch Get Locations; whereas inter-parallel is related to 1st latch Get Locations.

Besides that, the real system is influenced by Latch timeout, spinning, process Preemption, Multithreading, and hardware specialities (e.g. Solaris LWP, AIX SMT).

As evidenced by the model, there exists no deadlock in such a system since all operations are performed sequentially.


3. Model Justification


To verify the model, 3 parameters t1, t2, t3 have to be acquired.

Run foo_proc (see Blog: Row Cache Objects, Row Cache Latch on Object Type: Plsql vs Java Call (Part-1) (II)) with 10222 trace in an idle system.

Then excerpt all 3 Cache Gets (CID: 17, 11, 7) for one Object instance Get, for example, T_OBJ_OUT (This is only illustrative. In practice, more representative data should be captured).

Here is an example from Solaris (irrelevant lines removed):
(12cR1 10222 trace is hard to read, 12cR2 improved. see Blog: Oracle row cache objects Event: 10222, Dtrace Script (I))

--================ Start T_OBJ_OUT Get ================--

kqrfrpo : freed to fixed free list po=175d2f708 time=1459184662

  ##********* cid = 17, Start time=1459184662 *********--
kqrpad: new po 17727bac0 from kqrpre1.1

kqrReadFromDB : kqrpre1.1 po=17727bac0 flg=8000 cid=17 eq=1753067e0 idx=0 dsflg=0
kqrpre1 : done po=17727bac0 cid=17 flg=2 hash=3c4bb9d0 0 eq=1753067e0 SQL=begin :1 := foo(:2, :3, :4); end; time=1459184804

  ##------- cid = 17, End time=1459184917, Elapsed = 255 (1459184917-1459184662) ------- 
kqrfrpo : freed to fixed free list po=17727bac0 time=1459184917

  ##--********* cid = 11, Start time=1459184917, including cid = 7 *********--

kqrpad: new po 175d2f708 from kqrpre1.1

kqrReadFromDB : kqrpre1.3 po=175d2f708 flg=8000 cid=11 eq=1753067e0 idx=0 dsflg=0
kqrpre1 : done po=175d2f708 cid=11 flg=2 hash=94b841cf a9461655 eq=1753067e0 
          obobn=2360170 obname=T_OBJ_OUT obtyp=13 obsta=1 obflg=0 SQL=begin :1 := foo(:2, :3, :4); end; time=1459185132

  ##--********* cid = 7, Start time=1459185132 *********--

kqrpad: new po 170d94488 from kqrpre1.1

kqrReadFromDB : kqrpre1.3 po=170d94488 flg=8000 cid=7 eq=176e410b8 idx=0 dsflg=0
kqrpre1 : done po=170d94488 cid=7 flg=2 hash=de7751cd 395edb55 eq=176e410b8 
          SQL=begin :1 := foo(:2, :3, :4); end; time=1459185415

  ##------- cid = 7, End time=1459185542, Elapsed = 410 (1459185542-1459185132) -------
kqrfrpo : freed to heap po=170d94488 time=1459185542

kqrmupin : kqrpspr2 Unpin po 175d2f708 cid=11 flg=2 hash=a9461655 time=1459185556

  ##------- cid = 11, End time=1459185651, Elapsed = 734 (1459185651-1459184917), Pure = 324 (734-410) -------
kqrfrpo : freed to fixed free list po=175d2f708 time=1459185651

--================ End T_OBJ_OUT Get ================--
Look that example output from Solaris (irrelevant lines removed. AIX and Linux are added for comparison), values for 3 Cache Gets are captured ("time" in microsecond are enclosed in comment lines like <-- xx -->), then convert them into values (all prefixed by "->") without 10222 tracing according to value for SESSIONS=1 (no concurrency) in Table-1.
  
----------------------- Solaris ----------------------
  US_PER_EXEC = 51  -- Table-1 Solaris SESSIONS=1 
  
  t1 = 255 us -> 13  for cid = 17
  t2 = 324 us -> 17  for cid = 11 
  t3 = 410 us -> 21  for cid =  7 (covered by cid = 11)
  
  total Elapsed = 989 us (255+324+410) -> 51
  
----------------------- AIX ----------------------
  US_PER_EXEC = 65  -- Table-1 AIX SESSIONS=1 
  
  t1 = 143 us ->  9  for cid = 17
  t2 = 186 us -> 12  for cid = 11 
  t3 = 252 us -> 16  for cid =  7 (covered by cid = 11)
  
  total Elapsed = 581 us (143+186+252) -> 65  

----------------------- Linux ----------------------
  US_PER_EXEC = 30  -- Table-1 Linux SESSIONS=1 

  t1 = 136 us ->  8  for cid = 17
  t2 = 164 us ->  9  for cid = 11 
  t3 = 229 us -> 13  for cid =  7 (covered by cid = 11)
  
  total Elapsed = 529 us (136+164+229) -> 30  
Using the above average response time formula:
  
  (n-1) * (t3-t1)^2 / (2 * t1) + t3
substitute all variables and run following queries (for SESSIONS varying from 1 to 48), we get the average response time in microsecond (us) per execution: MODEL_US_PER_EXEC. Both test data and model data are shown in Table below.
  
-------- Solaris --------
select * from (
  select level sessions,
         round((level-1)*(21-13)*(21-13)/(2*13)) + 21 model_us_per_exec
  from dual connect by level < 100
) where sessions in (1, 3, 6, 12, 18, 24, 36, 42, 48);

-------- AIX --------
select * from (
  select level sessions,
         round((level-1)*(16-9)*(16-9)/(2*9)) + 16 model_us_per_exec
  from dual connect by level < 100
) where sessions in (1, 3, 6, 12, 18, 24, 36, 42, 48);

-------- Linux --------
select * from (
  select level sessions,
         round((level-1)*(13-8)*(13-8)/(2*8)) + 13 model_us_per_exec
  from dual connect by level < 100
) where sessions in (1, 3, 6, 12, 18, 24, 36, 42, 48);
All tests are done in Oracle 12.1.0.2.0 on Solaris, Linux, AIX (SMT 4, LCPU=24) with 6 physical processors. Linux and AIX are added for comparison.

Parallel | Test       | Test        | Model      
SESSIONS | EXECUTIONS | US_PER_EXEC | US_PER_EXEC
---------| ---------- | ----------- | -----------
Solairs  |            |             |            
       1 |  8,096,318 |          51 |          21
       3 | 21,850,342 |          57 |          26
       6 | 32,609,592 |          75 |          33
      12 | 42,217,169 |         125 |          48
      18 | 42,890,535 |         193 |          63
      24 | 42,955,245 |         265 |          78
      36 | 43,117,305 |         406 |         107
      42 | 42,523,671 |         480 |         122
      48 | 40,546,142 |         561 |         137
---------| ---------- | ----------- | -----------
AIX      |            |             |            
       1 |  6,535,357 |          65 |          16
       3 | 17,355,458 |          74 |          21
       6 | 21,327,093 |         117 |          30
      12 | 24,797,761 |         198 |          46
      18 | 25,037,784 |         297 |          62
      24 | 20,691,357 |         542 |          79
      36 | 19,048,665 |         963 |         111
      42 | 19,199,984 |       1,132 |         128
      48 | 19,214,011 |       1,306 |         144
---------| ---------- | ----------- | -----------
Linux    |            |             |            
       1 | 13,992,932 |          30 |          13
       3 | 32,715,487 |          39 |          16
       6 | 52,665,062 |          48 |          21
      12 | 49,743,420 |          98 |          30
      18 | 50,448,264 |         149 |          40
      24 | 50,836,921 |         201 |          49
      36 | 51,864,133 |         307 |          68
      42 | 48,914,411 |         390 |          77
      48 | 48,549,375 |         460 |          86
Now it is open to judge the model by comparing the empirical observations with model predicated values, and inspect its capability of extrapolation.


4. Row Cache Objects Latch: dedicated


Following query shows that there exists a 1-to-1 mapping between Row Cache and Row Cache Latch. Each Cache is protected by one single dedicated Latch. Hence accessing Row Cache is serialized by its designated Latch.

create view sys.x_kqrst as select * from x$kqrst;

select count(distinct latch_child), count(distinct cache#)
  from (
  select la.child# latch_child, kqrstcid cache#, kqrsttxt parameter, dc.*, la.* 
  from sys.x_kqrst dc, v$latch_children la 
  where dc.kqrstcln = la.child# and la.name='row cache objects'
  )
where kqrsttyp =1
order by latch_child;

COUNT(DISTINCTLATCH_CHILD)  COUNT(DISTINCTCACHE#)
--------------------------  ---------------------
56                          56


select *
  from (
  select la.child# latch_child, kqrstcid cache#, kqrsttxt parameter, kqrsttyp  --, dc.*, la.* 
  from sys.x_kqrst dc, v$latch_children la 
  where dc.kqrstcln = la.child# and la.name='row cache objects'
  )
where kqrsttyp =1   -- parent
order by latch_child;

LATCH_CHILD  CACHE#  PARAMETER              KQRSTTYP
----------- ------- ---------------------  ---------
8            10      dc_users               1
9            8       dc_objects             1
10           17      dc_global_oids         1
18           15      dc_props               1
For example, to access Object Type: T_OBJ_OUT, we should sequentially go though 3 Caches:
  dc_global_oids
  dc_objects
  dc_users
and each of which is guarded by its responsible Child Latch:
  10
  9
  8  
It looks like that "row cache objects Latch" works similar to the single "Result Cache: RC Latch" discussed in Blog: PL/SQL Function Result Cache Invalidation (I). But the difference is that later is optional, and can be turned off; but the former is an unavoidable obstacle.

To this point, we can have a further look of other Oracle Latches. Probably only cache buffers chains Latch is conceived to be scale (nonetheless "latch: cache buffers chains" Wait Event) since each of which has a limited responsibility and it amounts to majority of Latch (> 75%). Any excessive usage of other Latches is rated as irrational, hence contentions.

select l.name, count(*) cnt, round(100*ratio_to_report(count(*)) over (), 2) ratio 
from v$latch l, v$latch_children k
where l.latch# = k.latch#(+) 
--  and l.name in ('cache buffers chains', 'row cache objects', 'shared pool', 'Result Cache: RC Latch', 'kokc descriptor allocation latch')
--  and lower(l.name) like '%kokc%'
group by l.name order by cnt desc, l.name;

NAME                               CNT     RATIO
--------------------------------- ------  ------
cache buffers chains               32768   75.38
simulator hash latch               2048    4.71
...                                        
row cache objects                  56      0.13
...                                        
shared pool                        7       0.02
...                                        
Result Cache: RC Latch             1       0
... 
kokc descriptor allocation latch   1       0

770 rows selected.
(-------------- cache buffers chains --------------
Hidden Parameters and default values: 
  _db_block_hash_latches=32768    Number of database block hash latches,
  _db_block_hash_buckets=1048576  Number of database block hash buckets 
           1048576/32768=32       buckets per latch
 
Blog: Hot Block Identification and latch: cache buffers chains
-------------- simulator hash latch --------------
Blog: Row Cache and Sql Executions (IV)
-------------- row cache objects --------------
"Current Location" 
and
Blog: nls_database_parameters, dc_props, latch: row cache objects
-------------- shared pool --------------
Blog: sql_id and idn to shared pool subpool
-------------- Result Cache: RC Latch --------------
Blog: Result Cache:0 rows updated Invalidations (IV)
-------------- kokc descriptor allocation latch --------------
Bug 14382262  latch free wait event in"kokc descriptor allocation latch"

Blog: dbms_aq.dequeue - latch: row cache objects on AIX
Blog: Shared Pool - KKSSP
)


5. AWR/ASH - Row Cache Objects


We collect AWR/ASH reports for one 10 minutes Run of Plsql Test in 42 parallel sessions, and extract all Row Cache Objects related Sections.


5.1. Load Profile


DB CPU(s) Per Second: 5.9 shows that all 6 CPU are busy (CPU intensive), and about 100*(1-5.9/66.2) = 91% of DB Time are in waiting.

Load Profile
  Per Second Per Transaction Per Exec Per Call
DB Time(s): 66.2 373.6 0 195.01
DB CPU(s): 5.9 33.2 0 17.34


5.2. Top 10 Foreground Events by Total Wait Time


23 %DB time is in Waiting Event for latch: row cache objects, with Wait Avg(ms) 46.85 (probably bigger than 10 ms CPU Scheduling time slice).

9 %DB time is in Waiting Event for library cache: mutex X, with Wait Avg(ms) 10.49.

Top 10 Foreground Events by Total Wait Time
Event Waits Total Wait Time (sec) Wait Avg(ms) % DB time
latch: row cache objects 196,081 9185.8 46.85 23
library cache: mutex X 367,581 3856 10.49 9.6
DB CPU   3554.8   8.9


5.3. SQL ordered by Elapsed Time


SQL_ID: 241pfd82cu4wj has 42,533,404 Executions with Elapsed Time 20,423.63 seconds, all in CPU.

SQL ordered by Elapsed Time
Elapsed Time (s) Executions Elapsed Time per Exec (s) %Total %CPU %IO SQL Id SQL Text
24,895.60 42 592.75 62.27 14.26 0 22thkvvjndnmj DECLARE job BINARY_INTEGER := ...
20,423.63 42,533,404 0 51.09 12.81 0 241pfd82cu4wj begin :1 := foo(:2, :3, :4); e...


5.4. Latch Activity and Latch Sleep Breakdown


Total Latch Gets is 1,916,333,412, with 197,952 Sleeps, about 7.5% of Latch Miss (Gets/Misses=143,650,808/1,916,333,412). 197,952 Sleeps matches Latch Waits in Section "Top 10 Foreground Events by Total Wait Time".

Look rco_stats_v table for 42 SESSIONS, the 3 GETs for 3 Caches (dc_global_oids, dc_object_grants, dc_users) summing together is 1,915,994,113, close to above Total Latch Gets.

Same matches for MISSES, SLEEPS, SPIN_GETS, WAIT_TIME_S.

Latch Activity
Latch Name Get Requests Pct Get Miss Avg Slps /Miss Wait Time (s)
row cache objects 1,916,333,412 7.5 0 9186

Latch Sleep Breakdown
Latch Name Get Requests Misses Sleeps Spin Gets
row cache objects 1,916,333,412 143,650,808 197,952 143,454,769


5.5. Latch Miss Sources


Blog: Oracle row cache objects Event: 10222, Dtrace Script (I) shows that each Latch Get goes through 3 consecutive Locations ("Where"):
  Where=>4441(0x1159): kqrpre: find obj   -- latch Get at 1st Location
  Where=>4464(0x1170): kqreqd             -- latch Get at 2nd Location
  Where=>4465(0x1171): kqreqd: reget      -- latch Get at 3rd Location
Here we can see "kqrpre: find obj" is main blocker (Waiter Sleeps: 179,972), whereas "kqreqd: reget" is the top victim (Sleeps: 157,629), which accounts for about 80% (157,629/197,952) of Latch Sleeps.

In the above discussion, we noted that inter-parallel is related to 1st Latch Get Locations.

Latch Miss Sources
Latch Name Where NoWait Misses Sleeps Waiter Sleeps
row cache objects kqreqd: reget 0 157,629 3,121
row cache objects kqrpre: find obj 0 33,484 179,972
row cache objects kqreqd 0 6,812 14,827


5.6. Mutex Sleep Summary


This is related to library cache: mutex X (to be discussed in next Section).

Mutex Sleep Summary
Mutex Type Location Sleeps Wait Time (ms)
Library Cache kglpndl1 95 163,409 1,187,289
Library Cache kglhdgn2 106 161,608 1,191,330
Library Cache kglpin1 4 58,848 393,368
Library Cache kgllkdl1 85 31,551 186,930
Library Cache kglpnal1 90 30,451 176,736
Library Cache kglget1 1 30,157 233,526
Library Cache kglhdgn1 62 26,839 229,241


5.7. Child Latch Statistics


See rco_stats_v.

Child Latch Statistics
Latch Name Child Num Get Requests Misses Sleeps Spin & Sleeps 1->3+
row cache objects 9 638,804,551 56,606,984 65,237 56542348/0/0/0
row cache objects 8 638,793,715 22,654,054 56,248 22598358/0/0/0
row cache objects 10 638,788,857 64,389,752 76,467 64314045/0/0/0


5.8. Dictionary Cache Stats


See rco_stats_v.

I almost overlooked this Section because each popular Oracle term gets two names.

Dictionary Cache Stats
Cache Get Requests Pct Miss Scan Reqs Pct Miss Mod Reqs Final Usage
dc_global_oids 212,933,433 0 0   0 69
dc_objects 212,938,294 0 0   117 3,585
dc_users 212,935,101 0 0   0 89


5.9. Top SQL with Top Events, Top Events


This ASH Section listed our foo statement triggering the Top Events.

Top SQL with Top Events
SQL ID Plan Hash Executions % Activity Event % Event SQL Text
241pfd82cu4wj   2048 84.37 latch: row cache objects 41.39 begin :1 := foo(:2, :3, :4); e...
241pfd82cu4wj   2048 84.37 CPU + Wait for CPU 34.99 begin :1 := foo(:2, :3, :4); e...
241pfd82cu4wj   2048 84.37 library cache: mutex X 7.92 begin :1 := foo(:2, :3, :4); e...

Top Events
Event Event Class Session Type % Activity Avg Active Sessions
CPU + Wait for CPU CPU FOREGROUND 45.57 18.39
latch: row cache objects Concurrency FOREGROUND 41.39 16.71
library cache: mutex X Concurrency FOREGROUND 12.92 5.22


5.10. Top Event P1/P2/P3 Values



Top Event P1/P2/P3 Values
Event % Event P1, P2, P3 Values % Activity Parameter 1 Parameter 2 Parameter 3
latch: row cache objects 41.39 "6490258456","411","0" 18.91 address number tries
latch: row cache objects   "6490257608","411","0" 11.94 address number tries
latch: row cache objects   "6490258032","411","0" 10.54 address number tries
library cache: mutex X 12.92 "595386336", "3143916060672", "10136848668098666" 0.66 idn value where

Pick all 3 P1 Parameters in Event latch, run query below. The output shows that they point to the 3 Child_Latches (10, 9, 8) repectively.

select to_number(addr, 'XXXXXXXXXXXXXXXX') addr_num, addr,
       child#, gets, misses, sleeps, spin_gets, wait_time
 from v$latch_children lat
where latch#=411 
  and (addr like '%'||trim(to_char(6490258456, 'XXXXXXXXX'))
    or addr like '%'||trim(to_char(6490257608, 'XXXXXXXXX'))
    or addr like '%'||trim(to_char(6490258032, 'XXXXXXXXX')))
order by sleeps desc;

ADDR_NUM    ADDR              CHILD#  GETS       MISSES    SLEEPS  SPIN_GETS  WAIT_TIME
----------  ----------------  ------  ---------  --------  ------  ---------  ----------
6490258456  0000000182D97C18  10      638829560  64389752  76467   64314045   3905162356
6490258032  0000000182D97A70  9       638989001  56607014  65237   56542378   3006243205
6490257608  0000000182D978C8  8       639105230  22654062  56248   22598366   2274407991


6. library cache: mutex X


Look ASH Section - Top Event P1/P2/P3 Values, pick P1 Parameters in Event mutex, run query below.

The output shows that P1 Parameter: "595386336" represents T_OBJ_IN.

with start_snap as (
  select mutex_identifier, location, max(gets) gets, max(sleeps) sleeps, min(sleep_timestamp) min_ts
   from v$mutex_sleep_history
  where sleep_timestamp < sysdate -30/1440 
   group by mutex_identifier, location)
 ,end_snap as (
  select mutex_identifier, location, min(gets) gets, min(sleeps) sleeps, max(sleep_timestamp) max_ts
   from v$mutex_sleep_history
  where sleep_timestamp < sysdate -1/1440 
   group by mutex_identifier, location)
 ,delta as (
  select e.mutex_identifier, e.location, 
       e.gets   - nvl(s.gets, 0)   gets,
       e.sleeps - nvl(s.sleeps, 0) sleeps, min_ts, max_ts
  from start_snap s, end_snap e
  where e.mutex_identifier = s.mutex_identifier (+) 
    and e.location = s.location (+))
 ,snap as (
  select mutex_identifier, sum(gets) gets, sum(sleeps) sleeps, min(min_ts) min_ts, max(max_ts) max_ts
    from delta group by mutex_identifier
   order by sleeps desc
  )  
select o.hash_value, name, s.gets, s.sleeps, s.min_ts, s.max_ts
 from v$db_object_cache o, snap s
where o.hash_value = s.mutex_identifier
--  and mutex_identifier = 595386336
  and rownum <= 5
order by s.sleeps desc;

HASH_VALUE  NAME         GETS        SLEEPS
----------  -----------  ----------  ------
2812823600  T_OBJ_RET    1063400668  19362
668251247   T_OBJ_INOUT  1276396207  13660
595386336   T_OBJ_IN     1063241644   9290
415170529   T_OBJ_OUT    1063009333   8719
4198378475  FOO           254328227   1902
By the above output, if T_OBJ_IN occupies to 0.66 %Activity, T_OBJ_RET could be estimated as 1.38% (0.66*19362/9290) %Activity. Probably not always the TOP consumer is displayed in ASH.

Following query would give an output similar to AWR Section: Mutex Sleep Summary.
(The data in v$mutex_sleep_history is contained within a circular buffer, with the most recent sleeps shown)

with start_snap as (
  select mutex_identifier, location, max(gets) gets, max(sleeps) sleeps
   from v$mutex_sleep_history
  where sleep_timestamp < timestamp'2018-12-17 08:04:30' 
   group by mutex_identifier, location)
 ,end_snap as (
  select mutex_identifier, location, min(gets) gets, min(sleeps) sleeps
   from v$mutex_sleep_history
  where sleep_timestamp < timestamp'2018-12-17 08:14:50'
   group by mutex_identifier, location)
 ,delta as (
  select e.mutex_identifier, e.location, 
       e.gets   - nvl(s.gets, 0)   gets,
       e.sleeps - nvl(s.sleeps, 0) sleeps
  from start_snap s, end_snap e
  where e.mutex_identifier = s.mutex_identifier (+) 
    and e.location = s.location (+))
select location, sum(gets) gets, sum(sleeps) sleeps
  from delta group by location
 order by sleeps desc;
 
LOCATION                    GETS        SLEEPS
--------------------------  ----------  ------
kglhdgn2 106                977757576   13812
kglpndl1  95                1062042442  12725
kglpin1   4                 1020903933  11455
kglpnal1  90                 680931048   9432
kgllkdl1  85                 425439695   5679
kksLockDelete [KKSCHLPIN6]   425095566   3199
kglget1   1                  255195587   2600
kglhdgn1  62                 336370181   2466
kksxsccmp [KKSCHLPIN5]       169434152    242
kksfbc [KKSCHLFSP2]          169595310    232
A quick alternative is to create multiple copies of hot objects, for example, for T_OBJ_IN, and T_OBJ_RET (See Blog: "library cache: mutex X" and Application Context)

alter system set "_kgl_hot_object_copies"= 255 scope=spfile;

alter system set "_kgl_debug"=
      "name='T_OBJ_IN'   schema='K'    namespace=1  debug=33554432", 
      "name='T_OBJ_RET'  schema='K'    namespace=1  debug=33554432"
      scope=spfile;


7. Test Code: Plsql



drop sequence foo_seq;

create sequence foo_seq;

create or replace force view rco_stats_record_v as
with rc as (
  select min(parameter) rc_parameter, sum(gets) rc_gets, sum(getmisses) rc_getmisses, 
         decode(substr(parameter, 1, 9), 'dc_users', 8, 'dc_object', 9, 'dc_global', 10) la_child#,
         listagg(cache#, ', ') within group (order by cache#) cache#_list,
         listagg(parameter, ', ') within group (order by length(parameter)) parameter_list
  from   v$rowcache where cache# in (7,10, 8,11, 17) and gets > 0 
  group by substr(parameter, 1, 9))
select localtimestamp ts, addr, child#, name, gets, misses, sleeps, spin_gets, wait_time, rc.*, 3*gets RC_theory_gets
from v$latch_children la, rc
where la.latch# = 411 and la.child# in (8, 9, 10) 
  and la.child# = rc.la_child# 
order by la.child#; 

drop table rco_stats;

create table rco_stats as select 0 run, 0 sessions, v.* from rco_stats_record_v v where 1=2;

create or replace force view rco_stats_v  as
select run, sessions, ts, child#, rc_parameter
      ,gets - lag(gets) over (partition by rc_parameter order by run, sessions) gets
      ,misses - lag(misses) over (partition by rc_parameter order by run, sessions) misses
      ,sleeps - lag(sleeps) over (partition by rc_parameter order by run, sessions) sleeps
      ,spin_gets - lag(spin_gets) over (partition by rc_parameter order by run, sessions) spin_gets
      ,round((wait_time - lag(wait_time) over (partition by rc_parameter order by run, sessions))/1e6) wait_time_s
      ,rc_gets - lag(rc_gets) over (partition by rc_parameter order by run, sessions) rc_gets
      ,rc_getmisses - lag(rc_getmisses) over (partition by rc_parameter order by run, sessions) rc_getmisses
      ,rc_theory_gets - lag(rc_theory_gets) over (partition by rc_parameter order by run, sessions) rc_theory_gets
      ,cache#_list, parameter_list
from rco_stats v;


-- Plsql  241pfd82cu4wj    begin :1 := foo(:2, :3, :4); end; 
-- Java   8nkt65d7hcnxz    BEGIN :1 := K.FOO(:2, :3, :4); END;

create or replace force view exec_stats_record_v as
select localtimestamp ts, sql_id, executions, elapsed_time, cpu_time, concurrency_wait_time
from v$sqlarea v 
where (lower(sql_text) like 'begin :1 := %foo(:2, :3, :4); end;%' or sql_id in ('241pfd82cu4wj', '8nkt65d7hcnxz'))
  and last_active_time = (select max(last_active_time) from v$sqlarea where sql_id in ('241pfd82cu4wj', '8nkt65d7hcnxz'))
  and rownum = 1
order by v.last_active_time desc;     

drop table exec_stats;

create table exec_stats as select 0 run, 0 sessions, v.* from exec_stats_record_v v where 1=2;

create or replace force view exec_stats_v as
select run, sessions, ts
      ,executions - lag(executions) over (order by run, sessions) executions
      ,round((elapsed_time - lag(elapsed_time) over (order by run, sessions))/1e6) elapsed_time_s
      ,round((cpu_time - lag(cpu_time) over (order by run, sessions))/1e6) cpu_time_s
      ,round((concurrency_wait_time - lag(concurrency_wait_time) over (order by run, sessions))/1e6) concurrency_wait_time_s
      ,round((elapsed_time - lag(elapsed_time) over (order by run, sessions))/(nullif (executions - lag(executions) over (order by run, sessions), 0))) us_per_exec
      ,sql_id
from exec_stats;

create or replace procedure record_stats (p_sessions number) as 
  l_run   number;
begin
  l_run := foo_seq.nextval;
  insert into rco_stats  select l_run, p_sessions, v.* from rco_stats_record_v v;
  insert into exec_stats select l_run, p_sessions, v.* from exec_stats_record_v v;
  commit;
  sys.dbms_workload_repository.create_snapshot('ALL');  
end;
/

create or replace procedure k.foo_proc_sessions(p_session_cnt number, p_cnt number) as
  l_job_id pls_integer;
begin
  for i in 1.. p_session_cnt loop
    --dbms_job.submit(l_job_id, 'begin foo_proc('||p_cnt||'); end;');
    dbms_job.submit(l_job_id, 'begin while true loop foo_proc('||p_cnt||'); end loop; end;');
  end loop;
  commit;
end;
/

--exec foo_proc_sessions(6, 1e9);

--exec clean_jobs;

create or replace procedure k.foo_proc_sessions_ctrl(p_cnt number, p_dur_min number := 10) as
  type    t_num_tab is table of number;
  l_nums  t_num_tab := new t_num_tab(1, 3, 6, 12, 18, 24, 36, 42, 48);
  l_run   number;
begin
  foo_proc(10);   -- warmup
  clean_jobs; 
  record_stats(0);
  
  for n in 1..l_nums.count loop
    foo_proc_sessions(l_nums(n), p_cnt);    -- n Sessions
    dbms_lock.sleep(p_dur_min*60);          -- m minutes
    clean_jobs; 
    record_stats(l_nums(n)); 
  end loop;
end;
/

-- exec foo_proc_sessions_ctrl(1e9, 10);


8. Test Code: UNIX Script for Java Test



-------------- create Script: foo_proc_sessions_java ------
#!/bin/ksh

# setup CLASSPATH
export CLASSPATH=$CLASSPATH:java-path:jdbc-path:RCOObjTypeJDBC1-path

function db_record_stats {
  `sqlplus -s /nolog << EOF
  connect k/k
  exec k.record_stats($1);
  exit;
  EOF`  
}

# warm up
/usr/java/bin/java -Xmx30m -Xms8m RCOObjTypeJDBC1 "jdbc:oracle:thin:k/s@testdb:1521:testdb"  10 2
sleep 2

# MOS (Doc ID 2336567.1). new Oracle Database 12c installation.
# ./foo_proc_sessions_java[20]: db_record_stats[9]: PL/SQL: not found [No such file or directory]

db_record_stats 0

for i in 1 3 6 12 18 24 36 42 48
do
  echo "Sessions Count: $i"
  j=1
  while [[ $j -le $i ]]
  do
    echo "Run: $j"
    /usr/java/bin/java -Xmx30m -Xms8m RCOObjTypeJDBC1 "jdbc:oracle:thin:k/s@testdb:1521:testdb"  1000000000 2 &
    (( j = j + 1 ))
  done
  
  sleep $1   # run interval
  
  for pid in $(ps -ef | awk '/RCOObjTypeJDBC1/ {print $2}')
  do 
      if [ -n "$pid" -a -e /proc/$pid ]; then
          echo "Kill Pid: $pid"   # kill if exist
          kill -9 $pid 
      fi    
  done
  sleep 2     # grace period for killing 
  
  db_record_stats $i
done


-------------- called by ----
chmod a+x ./foo_proc_sessions_java
ksh ./foo_proc_sessions_java 600