Thursday, December 15, 2022

OracleJVM JAVA_JIT_ENABLED Linux /dev/shm mount noexec EPERM and Performance

In Linux, when JAVA_JIT_ENABLED is enabled, the native compiled Java code is stored in /dev/shm/JOEZSHM_*.
However, if tmpfs /dev/shm/ is mounted with "noexec" option as follows:

mount | grep shm
   tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,noexec)
mmap of JOEZSHM_* gets error EPERM ("Operation not permitted"), and JAVA_JIT_ENABLED does not work.

In MZnn and session trace files, there are plenty of "Operation not permitted" (EPERM) for JOEZSHM_*.

As a consequence, Oracle falls back to JAVA_JIT_ENABLED=false and the performance is degraded upto 50 times slower than the functional case.

Note: Tested on Oracle 19.13.


1. Test-1 Simple Case


First we make a simple SQL Java call (dbms_java.getversion calls java.lang.System.getProperty via dbms_java.get_ojvm_property), and strace its Linux process:

SQL > select 'JServer version: '||dbms_java.getversion, 'JDK version:'||dbms_java.get_jdk_version from dual;  

$> strace -tT -o mmap_strace.log -p 16637
After the call, we can see the new created memory files: /dev/shm/JOEZSHM_*:

$> ls -l /dev/shm
  total 32768
  -rwxrwx--- 1 oracle dba 16777216 Dec 13 07:36 JOEZSHM_testdb_1_0_0_0_0_2084322508
  -rwxrwx--- 1 oracle dba 16777216 Dec 13 07:36 JOEZSHM_testdb_1_0_1_0_0_749348400
  
-- no entry found, no JOEZSHM_* mapped to process address space
$> pmap -X -p 16637 |grep -i JOEZSHM
mmap_strace.log shows:

  07:36:23 open("/dev/shm/JOEZSHM_testdb_1_0_0_0_0_2084322508", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0770) = 7 <0.000456>
  07:36:23 lseek(7, 0, SEEK_CUR)          = 0 <0.000018>
  07:36:23 lseek(7, 0, SEEK_END)          = 16777216 <0.000023>
  07:36:23 lseek(7, 0, SEEK_SET)          = 0 <0.000030>
  07:36:23 mmap(NULL, 16777216, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_SHARED, 7, 0) = -1 EPERM (Operation not permitted) <0.000022>
MZ00 and session trace files are filled with:

sjoezshm_map_obj mmap failed for /JOEZSHM_testdb_1_0_0_0_0_2084322508: Operation not permitted
size  = 16777216, prot_mode  = 7, map_flag = 1
joez_shm_open_object failed: size = 16777216, extnam = /JOEZSHM_testdb_1_0_0_0_0_2084322508  flags = 0x34 
joez: Failed loading machine code: Unable to allocate code space

sjoezshm_map_obj mmap failed for /JOEZSHM_testdb_1_0_1_0_0_749348400: Operation not permitted
size  = 16777216, prot_mode  = 7, map_flag = 1
joez_shm_open_object failed: size = 16777216, extnam = /JOEZSHM_testdb_1_0_1_0_0_749348400  flags = 0x34 
joez: Failed loading machine code: Unable to allocate code space


2. Test-2 Complex Case


Use the same test code in Blog: Oracle 19.4 OracleJVM JAVA_JIT_ENABLED Not Working on AIX to read a 27MB signed JAR file testJar.jar (security signatures under META-INF directory).

$> ls -l /tmp/testJar.jar
  -rw-r--r-- 1 oracle dba 28944084 Dec  6 17:39 /tmp/testJar.jar

$> unzip -l /tmp/testJar.jar
  Archive:  /tmp/testJar.jar
    Length      Date    Time    Name
  ---------  ---------- -----   ----
        144  12-05-2019 08:27   META-INF/MANIFEST.MF
        306  12-05-2019 08:27   META-INF/KUNALIAS.SF
       1471  12-05-2019 08:27   META-INF/KUNALIAS.DSA
          0  12-05-2019 08:26   META-INF/
  110947240  12-05-2019 08:23   test1.txt
  ---------                     -------
  110949161                     5 files
First we invoke OracleJVMJarInputStream without signature verify, it takes 3 seconds:

Sql > set serveroutput on size 50000
Sql > exec dbms_java.set_output(50000); 

SQL > exec OracleJVMJarInputStream(p_verify => 'false', p_info => 'no');

********* getNextJarEntry *********
------ NextJarEntry: 1, Name: META-INF/KUNALIAS.SF ------
         getNextEntry ElapsedMills: 31, at: 1670863088140
         Insert DB 1 row, blob size 0 at Mon Dec 12 17:38:08 CET 2022
         readContent ElapsedMills: 5, at: 1670863088145
------ NextJarEntry: 2, Name: META-INF/KUNALIAS.DSA ------
         getNextEntry ElapsedMills: 0, at: 1670863088145
         Insert DB 1 row, blob size 0 at Mon Dec 12 17:38:08 CET 2022
         readContent ElapsedMills: 2, at: 1670863088147
------ NextJarEntry: 3, Name: META-INF/ ------
         getNextEntry ElapsedMills: 0, at: 1670863088148
         Insert DB 1 row, blob size 0 at Mon Dec 12 17:38:08 CET 2022
         readContent ElapsedMills: 2, at: 1670863088150
------ NextJarEntry: 4, Name: test1.txt ------
         getNextEntry ElapsedMills: 0, at: 1670863088150
         Insert DB 1 row, blob size 110920480 at Mon Dec 12 17:38:11 CET 2022
         readContent ElapsedMills: 3125, at: 1670863091275

PL/SQL procedure successfully completed.
Elapsed: 00:00:03.20
Then we invoke OracleJVMJarInputStream with signature verify, it takes 3 minutes:

Sql > exec OracleJVMJarInputStream(p_verify => 'true', p_info => 'no');

********* getNextJarEntry *********
------ NextJarEntry: 1, Name: META-INF/KUNALIAS.SF ------
         getNextEntry ElapsedMills: 1, at: 1670863115333
         Insert DB 1 row, blob size 0 at Mon Dec 12 17:38:35 CET 2022
         readContent ElapsedMills: 6, at: 1670863115339
------ NextJarEntry: 2, Name: META-INF/KUNALIAS.DSA ------
         getNextEntry ElapsedMills: 0, at: 1670863115339
         Insert DB 1 row, blob size 0 at Mon Dec 12 17:38:35 CET 2022
         readContent ElapsedMills: 423, at: 1670863115762
------ NextJarEntry: 3, Name: META-INF/ ------
         getNextEntry ElapsedMills: 0, at: 1670863115762
         Insert DB 1 row, blob size 0 at Mon Dec 12 17:38:35 CET 2022
         readContent ElapsedMills: 3, at: 1670863115765
------ NextJarEntry: 4, Name: test1.txt ------
         getNextEntry ElapsedMills: 0, at: 1670863115765
         Insert DB 1 row, blob size 110920480 at Mon Dec 12 17:41:35 CET 2022
         readContent ElapsedMills: 179651, at: 1670863295417

PL/SQL procedure successfully completed.
Elapsed: 00:03:00.12
If JAVA_JIT_ENABLED works (/dev/shm is mounted without "noexec"), the execution with signature verify is completed in 5 seconds (instead of 3 minutes).


3. Test-3 Standalone mmap.c Test


Web Page: mmap() fails on tmpfs (operation not permitted) #2974 provided a mmap.c test code and showed "mmap() failed: : Operation not permitted".

We can use the same code to simulate Oracle JIT /dev/shm/JOEZSHM_* test.
(with small adaptations of open and mmap arguments to Oracle parameters according to strace output in above Test-1 Simple Case).

// mmap-fail.cpp
#include <stdint.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

// $ g++ mmap-fail.cpp -o mmap-fail
// $ sudo ./mmap-fail
// mmap() failed: : Operation not permitted

int main(int argc, const char* argv[])
{
    //const char* path = "./eightbytes.bin"; // okay on lxfs
    //const char* path = "/run/eightbytes.bin";
    //int fd = open(path, O_RDWR|O_CREAT|O_CLOEXEC|O_NOCTTY|O_NOFOLLOW, 0644);
    int fd = open("/dev/shm/JOEZSHM_testdb_1_0_0_0_0_2084322508", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0770);
    if (fd < 0) {
        perror("open() failed: ");
        return 1;
    }
    int r = posix_fallocate(fd, 0, sizeof(uint64_t));
    if (r < 0) {
        perror("posix_fallocate() failed: ");
        return 1;
    }
    //uint64_t *p = (uint64_t*)mmap(NULL, sizeof(uint64_t), PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    uint64_t *p = (uint64_t*)mmap(NULL, sizeof(uint64_t), PROT_READ|PROT_WRITE|PROT_EXEC, MAP_SHARED, fd, 0);
    if (p == MAP_FAILED) {
        perror("mmap() failed: ");
        return 1;
    }
    int r2 = close(fd);

    printf("Whee! mmap SUC\n");
    return 0;
}
Run the test with strace:

$> strace -tT -o mmap-fail_strace.log ./mmap-fail
  mmap() failed: : Operation not permitted
mmap-fail_strace.log shows:

  07:42:37 open("/dev/shm/JOEZSHM_testdb_1_0_0_0_0_2084322508", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0770) = 3 <0.000022>
  07:42:37 fallocate(3, 0, 0, 8)          = 0 <0.000018>
  07:42:37 mmap(NULL, 8, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_SHARED, 3, 0) = -1 EPERM (Operation not permitted) <0.000017>


4. mmap.c EPERM Operation not permitted


From source mm/mmap.c, we can see that EPERM (Operation not permitted) is thrown in do_mmap
(case MAP_SHARED fallthrough to case MAP_PRIVATE).

// mm/mmap.c
unsigned long do_mmap(struct file *file, unsigned long addr,
  ...
	if (file) {
		...
		switch (flags & MAP_TYPE) {
		case MAP_SHARED:
      ...
			fallthrough;
		case MAP_SHARED_VALIDATE:
      ...
			fallthrough;
		case MAP_PRIVATE:
			if (!(file->f_mode & FMODE_READ))
				return -EACCES;
			if (path_noexec(&file->f_path)) {
				if (vm_flags & VM_EXEC)
					return -EPERM;
				vm_flags &= ~VM_MAYEXEC;
			}
			
			
include/uapi/asm-generic/errno-base.h
  #define	EPERM		 1	/* Operation not permitted */


5. Fix


To fix EPERM problem on Linux for JAVA_JIT_ENABLED, remove noexec option for tmpfs /dev/shm and make it permanent in /etc/fstab:

$> mount -o remount,exec /dev/shm
So that it looks like:

$> mount |grep shm
     tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)

$> cat /etc/fstab |grep shm
     tmpfs /dev/shm tmpfs nosuid,nodev 0 0


6. Related Work and Other Observations


For further discussions on OracleJVM JAVA_JIT_ENABLED, see:
     What the heck are the /dev/shm/JOXSHM_EXT_x files on Linux?
     Oracle 19.4 OracleJVM JAVA_JIT_ENABLED Not Working on AIX

There is also contre-proposition:
     The Oracle Linux operating system must mount /dev/shm with the noexec option.

When JAVA_JIT_ENABLED works, and we strace MZ00 session for a while.
Trace file shows that MZ00 is continuously compiling Java, each time, only one method (not entire class).
The timestamp of /dev/shm/JOEZSHM_* shows that those memory files are also updated.

$> strace -tT -s 128 -o mz00_strace.txt -p 23037

-- (fd 6 is testdb_mz00_23037.trc)
$> grep "Done compiling.*locale*" mz00_strace.txt
  15:48:11 write(6, "Done compiling sun/util/locale/provider/TimeZoneNameUtility$TimeZoneNameGetter.getObject", 88) = 88 <0.000026>
  15:50:44 write(6, "Done compiling sun/util/locale/provider/TimeZoneNameUtility$TimeZoneNameGetter.getObject", 88) = 88 <0.000019>
  15:50:44 write(6, "Done compiling sun/util/locale/provider/TimeZoneNameProviderImpl.getDisplayNameArray", 84) = 84 <0.000025>
  15:50:44 write(6, "Done compiling sun/util/locale/provider/TimeZoneNameUtility$TimeZoneNameGetter.getName", 86) = 86 <0.000046>
  15:50:44 write(6, "Done compiling sun/util/locale/provider/LocaleResources.removeEmptyReferences", 77) = 77 <0.000035>
  15:50:45 write(6, "Done compiling sun/util/locale/provider/LocaleResources.getTimeZoneNames", 72) = 72 <0.000039>
  15:50:46 write(6, "Done compiling sun/util/locale/BaseLocale$Key.normalize", 55) = 55 <0.000030>
  15:50:47 write(6, "Done compiling sun/util/locale/provider/TimeZoneNameProviderImpl.getDisplayName", 79) = 79 <0.000057>
By the way, in the old Oracle release, there are many JIT compiled small files (KB) with name pattern like JOXSHM_EXT_*,
In Oracle 19c, there are only a few big files (16MB) with name pattern like JOEZSHM_*.

Update (2023-02-26)Oracle 19c Java Developer's Guide: 9.1 Oracle JVM Just-in-Time Compiler (JIT) documented this behaviour:
Note:
 
On Linux, Oracle JVM JIT uses POSIX shared memory that requires access to the /dev/shm directory.
The /dev/shm directory should be of type tmpfs and you must mount this directory as follows:
  -. With rw and execute permissions set on it
  -. Without noexec or nosuid set on it
If the correct mount options are not used, then the following failure may occur during installation of the database:
  ORA-29516: Aurora assertion failure: Assertion failure at joez.c:
             Bulk load of method java/lang/Object. failed; insufficient shm- object space

Sunday, December 4, 2022

One Test of Oracle too many parse errors with ORA-12850 and ORA-12872

In this Blog, we first make a test to demonstrate Oracle "too many parse errors" associated with ORA-12850 and ORA-12872.
Then we reveal both errors in Oracle trace files. It looks like a limitation of materialized WITH subquery in parallel statements.

Note: Reproduced in Oracle 19.9, 19.10, 19.11, 19.13, but not 19.17. All are non RAC DB.


1. Test Setup



create table test_tab as select level x from dual connect by level <= 10; 

create or replace procedure test_proc (p_cnt number, p_limit pls_integer, p_sleeps number) as
  type t_num_tab      is table of number;
  l_num_tab           t_num_tab := new t_num_tab();
begin
  for i in 1..p_cnt loop 
    dbms_session.sleep(p_sleeps);
    with sq as (select /*+ materialize  */ x from test_tab)
    select /*+ parallel(2) */ t.x  
      bulk collect into l_num_tab
      from test_tab t, sq 
     where rownum <= p_limit;
   end loop; 
   
   dbms_output.put_line('Selected Rows = '||l_num_tab.count);
end;
/


2. Test Run


(Step-1) Open a new Sqlplus session

(Step-2) Flush Shared Pool
            alter system flush shared_pool; 

(Step-3) Run test procedure
            exec test_proc(1000, 3, 0.1);

(Step-4) Check DB alert.log if there is message:
            WARNING: too many parse errors
   
(Step-5) If no such message, repeat Step-1 to Step-5.
         That means opening next new Sqlplus session to do the same test.
         Most of tests showed that one or two Sqlplus sessions are sufficient to reproduce the problem.
Here the message in DB alert.log. PARSE ERROR are signalled with error=12850 or error=12872.
Note that the message is only written for each 100 such errors (count=13300 jump to count=13400).

2022-12-01T20:01:08.221827+01:00
WARNING: too many parse errors, count=13300 SQL hash=0x49952973
PARSE ERROR: ospid=31115, error=12850 for statement: 
2022-12-01T20:01:08.221953+01:00
WITH SQ AS (SELECT /*+ materialize  */ X FROM TEST_TAB) SELECT /*+ parallel(2) */ T.X FROM TEST_TAB T, SQ WHERE ROWNUM <= :B1
Additional information: hd=0x97efe2e0 phd=0x9437c118 flg=0x110676 cisid=86 sid=86 ciuid=86 uid=86 sqlid=3z4qz1d4taabm
...Current username=K
...Application:  Action: 
2022-12-01T20:01:09.170414+01:00
WARNING: too many parse errors, count=13400 SQL hash=0x49952973
PARSE ERROR: ospid=31117, error=12872 for statement: 
2022-12-01T20:01:09.170531+01:00
WITH SQ AS (SELECT /*+ materialize  */ X FROM TEST_TAB) SELECT /*+ parallel(2) */ T.X FROM TEST_TAB T, SQ WHERE ROWNUM <= :B1
Additional information: hd=0x97efe2e0 phd=0x9437c118 flg=0x110676 cisid=86 sid=86 ciuid=86 uid=86 sqlid=3z4qz1d4taabm
...Current username=K
...Application: SQL*Plus Action:  
Appendix-1 lists a few related Oracle Error Documents.

Appendix-2 shows parallel xplan.


3. ORA-10388 Trace


Oracle (Doc ID 2180844.1) has a description of ORA-12850 (PX QC and worker signature-plan mismatch) and documents Troubleshooting Steps.

How to get plan signature tracing for ORA-12850 (Doc ID 2180844.1)
  Purpose
      ORA-12850 errors are usually caused by the worker getting a plan signature different from that of the QC.  
      These are typically, although not always, QRY OPTIMIZER bugs. This note provides the recommended tracing needed 
      in order to see if there is a signature-plan mismatch and diagnose the cause of ORA-12850.†

  Troubleshooting Steps
      alter system set†"_parallel_blackbox_size" =†524288;†
      -- prevents traces from being truncated
      alter session set max_dump_file_size = unlimited;                         
      -- customer will be prompted for a string that will appear in the tracefile name
      alter session set tracefile_identifier = '&TRACEFILE_IDENTIFIER';         
      alter session set events 'trace[SQL_Compiler.*] disk highest';
      alter session set events 'trace[Parallel_Execution.*] disk highest';
      
      --run test code, for example, in this Blog:
      exec test_proc(10, 3, 0.1);
      
      -- turn tracing off
      alter session set events 'trace[SQL_Compiler.*] off';
      alter session set events 'trace[Parallel_Execution.*] off';
    
    Look for the plan signatures in the QC vs. worker traces.  
    The worker that will have the mismatch should be the one with the 10388 error in it.  
    Notice that the plan signatures do NOT match, which causes the ORA-12850.
	
  3. Look for the plan signatures in the QC vs. worker traces.  
     The worker that will have the mismatch should be the one with the 10388 error in it.  
     Notice that the plan signatures do NOT match, which causes the ORA-12850.
     
        QC Plan Signature from Parse Msg:  SQL Plan Signature
        Slave Plan Signature:  SQL Plan Signature
Following above Oracle instructions, we get Plan Signature in worker (pnnn) traces like:

  Slave Plan Signature:  SQL Plan Signature (0x93838ff0):
  QC Plan Signature from Parse Msg:  SQL Plan Signature (0x7f0fed684134):
In Oracle Document, ORA-10388 Comment said: commands a slave to leave the server group (query failed).
So we will look for 10388 error in the trace files.

In worker (pnnn) trace files, internal oracle error=10388 (message to terminate a parallel query) is precised by "In signal-handler after catching 12850", and finalized by "Exiting parallel client parallel query execution(6) on error=10388".

====================== error 12850 ======================
2022-12-01 11:28:29.288*:PX_Control:kxfx.c@4849:kxfxsp1(begin): 
	Open & parse cursor(EXPRESS/-): 
	[123:WITH SQ AS (SELECT /*+ materialize  */ X FROM TEST_TAB) SELECT /*+ parallel(2) */ T.X FROM TEST_TAB T, SQ WHERE ROWNUM <= :B1 ]
2022-12-01 11:28:29.288*:PX_Control:kxfx.c@4853:kxfxsp1(end): 
2022-12-01 11:28:29.288*:PX_Control:kxfx.c@5885:kxfxsp(end): 
2022-12-01 11:28:29.327*:PX_Control:kxfx.c@6114:kxfxsStatus(): 
	joining FALSE ParseState KXFXUPARSE2 (3)
2022-12-01 11:28:29.331*:PX_Control:kxfx.c@10728:kxfxmai(): 
	In signal-handler after catching 12850
2022-12-01 11:28:29.333*:PX_Control:kxfx.c@10728:kxfxmai(): 
	In signal-handler after catching 10388
2022-12-01 11:28:29.333*:PX_Control:kxfx.c@5678:kxfxsuf(): 
	kxfx slave finish
2022-12-01 11:28:29.333*:PX_Messaging:kxfp.c@14665:kxfprdp_int(): 
	Exiting parallel client parallel query execution(6) on error=10388
In the same trace files, "Received error 12872" is noted with "First parse failed" during Semantic Analysis:

====================== error 12872 ======================
2022-12-01 11:29:30.411*:PX_Control:kxfx.c@6829:kxfxsExecute(): 
	Received error 12872 from first slave parse.
	First parse failed. On to outlined parse 
2022-12-01 11:29:30.411*:PX_Control:kxfx.c@13917:kxfxsStmtDump(): 
	kxfxsSmtExecute stmt:0x7f0fed69ff50
	 UGA: InUse:255KB Alloc:255KB
	 PGA: InUse:1665KB Alloc:1835KB
Enabling tracing for cur#=1 sqlid=3z4qz1d4taabm top-level
Parsing cur#=1 sqlid=3z4qz1d4taabm len=123 
sql=WITH SQ AS (SELECT /*+ materialize  */ X FROM TEST_TAB) SELECT /*+ parallel(2) */ T.X FROM TEST_TAB T, SQ WHERE ROWNUM <= :B1 
End parsing of cur#=1 sqlid=3z4qz1d4taabm
Semantic Analysis cur#=1 sqlid=3z4qz1d4taabm
OPTIMIZER INFORMATION


4. Performance Impacts


When multiple sessions frequently run such statement, especially with high degree of parallelism (DOP), DB performance degradation is often caused by following heavy wait events:

  cursor: pin S wait on X
  library cache lock
  kksfbc child completion
  PX Deq: Parse Reply
  PX Deq: Execution Msg
When looking the wait chains, often one PX particular worker is the blocker of QC or other workers. We also noticed that V$SQL.child_number can go over 1000 ("_cursor_obsolete_threshold" default 1024).

A quick workaround is to remove parallel hint to make serial execution.

Blog: 12c Parse listed a few activity stats to monitor:
  opened cursors cumulative         
  enqueue requests                  
  enqueue releases                  
  sql area purged                   
  sql area evicted                  
  parse count (total)               
  parse count (hard)                
  parse count (failures)            
Tests showed that the call (1000 executions of query):

  exec test_proc(1000, 3, 0.1); 
generated 6000 lines of "too many parse errors" in DB alert.log:
(the count seems reading from V$SQLAREA.invalidations for each sql_id)

  WARNING: too many parse errors, count=53700 SQL hash=0x49952973 ... sqlid=3z4qz1d4taabm
  ...
  WARNING: too many parse errors, count=59700 SQL hash=0x49952973 ... sqlid=3z4qz1d4taabm
v$sysstat query:

select v.value, v.* from v$sysstat v
where name in ('parse count (failures)', 'sql area purged', 'sql area evicted', 
               'parse count (hard)', 'parse count (total)', 
               'opened cursors cumulative', 'enqueue releases', 'enqueue requests') 
order by v.value;
shwed that:
   the value of 'parse count (failures)', 'sql area purged', 'sql area evicted' increased 6000
   the value of 'parse count (hard)', 'parse count (total)' increased a little more than 6000 (if no other DB activities)
So each test query execution created 6 "too many parse errors" related to 6 'parse count (failures)'.


5. ORA-00600 [xplUnCompact:magic]


In very rare case, QC session throws ORA-00600 as follows. It looks like that the error occurs when optimizer automatic sql reoptimisation (kkoarSalvageStats, COMPONENT: SQL_Costing) checks cardinality feedback (kkocfbCheckCardEstRws, COMPONENT: SQL_Costing) in xplan (xplUnCompact, COMPONENT: explain).

ORA-00600: internal error code, arguments: [xplUnCompact:magic], [0], [0x7FFE5C370D98]

----- Call Stack Trace -----

[12] (kgeasnmierr()+146 -> kgerinv())
[13] (xplUnCompact()+1320 -> kgeasnmierr())
[14] (kkocfbCheckCardEstRws()+815 -> xplUnCompact())
[15] (kkocfbCheckCardEstRws()+259 -> kkocfbCheckCardEstRws())
[16] (kkocfbCheckCardEstRws()+259 -> kkocfbCheckCardEstRws())
     -- a couple of same calling lines
[35] (kkocfbCheckCardEstRws()+259 -> kkocfbCheckCardEstRws())
[36] (kkocfbCheckCardEst()+571 -> kkocfbCheckCardEstRws())
[37] (kkoarSalvageStats()+910 -> kkocfbCheckCardEst())
[38] (kxsFreeWorkArea()+149 -> kkoarSalvageStats())
[39] (kxsFreeExecutionHeap()+60 -> kxsFreeWorkArea())
[40] (kksumc()+631 -> kxsFreeExecutionHeap())
[41] (opiexe()+10228 -> kksumc())

Session Wait History:
    elapsed time of 0.993485 sec since last wait
 0: waited for 'enq: PS - contention'


6. Related Oracle BUGs and Related Work


There are a few related BUGs in Oracle MOS, we tested them (patch installed or hidden parameter set. Test DB are not RAC).
All have no effect, and DB still throws the same "too many parse errors":

Bug 31602782 ††Contention on "CURSOR: PIN S WAIT ON X" when PQ slave's execution plan does not match with QC       
   Versions confirmed as being affected: 12.2.0.1 (Base Release) - 19.13.0
   Description
     When a parallel query is executed, QC sends parse message to all workers in all  
     instances. In a instance one worker tries to parse the cursor and other workers 
     waits (cursor pin s wait for x) for that cursor to parsed and built. 
      
     If this worker could not build the cursor with the same plan as QC and could not  
     join, another worker tries the same and so on until all workers get exhausted 
     and query runs in serial or without workers from that instance. 
      
     This can cause a huge contention on CURSOR: PIN S WAIT ON X .
      
     After this bug fix, if a worker on one instance fails to produce same plan as QC, 
     other workers on the same instance do not retry the parse and workers from 
     that instance are not allocted.
    
Bug 32121673††A Process May Crash With Ora-600 [xpluncompact:magic]
 
Warning: Too Many Parse Errors With ORA-12850/ORA-12872 (Doc ID 2816169.1)
  Cause
     It appears that this is due to Bug 31877812 : WHEN HIGH CONCURRENCY TOO MUCH CPU UTILIZATION AND PARALLEL 
     QUERY IS FAILING WITH ORA-12872, this bug is currently waiting on two other bugs to be resolved first. 
  Solution
    †Bug is not resolved as of yet but what has been able to get by the issue is setting the following parameter:
       alter system set "_nlj_batching_enabled"=0;
  
RAC: ORA-12801 and ORA-12850: Could Not Allocate Slaves On All Specified Instances: 2 Needed, 1 (Doc ID 1503358.1)
Blog: ORA-12850: Could not allocate slaves on all specified instances error and a workaround for my scripts talked about GV$ views for reading cluster-wide ASH data and PX slaves used for fetching remote data due to Xplan temp table transformations of materialized WITH subqueries, and provided workaround of "INLINE" hint in some TPT scripts.

The ORA-12850 indicates the limitation of materialized WITH subquery in Oracle 19c RAC installations because of GV$ view access.
This Blog test shows the limitation of materialized WITH subquery in parallel statements (non RAC).

Blog: Oracle WITH subquery_factoring_clause TEMP TABLE TRANSFORMATION and log file sync talked about "INLINE" hint in materialized WITH subqueries and log file sync.


Appendix-1. Oracle Error Document


ORA-12850, "Could not allocate slaves on all specified instances: %s needed, %s allocate d"
  *Cause: When executing a query on a gv$ fixed view, one or more                                                  
          instances failed to allocate a slave to process query.                                                   
  *Action:Check trace output for instances on which slaves failed to start.                                        
          GV$ query can only proceed if slaves can be allocated on all                                             
          instances.  
          
ORA-12872, "First slave parse gave different plan"                                                               
  *Cause: First hard parse on slave given QC-supplied environment and                                              
          parameters gave different plan from QC. Try again with outline.                                          
  *Action: No external action. Internally used for outline-based reparse.                                          

ORA-12842, "Cursor invalidated during parallel execution"
  *Cause:  The cursor was invalidated during the parse phase of deferred
           parallel processing, e.g. when set operands are parallelized.
  *Action: Depends on why the cursor was invalidated. Possible causes include
           DDL on a schema object and shared pool being flushed.

ORA-10388, "parallel query server interrupt (failure)"
  *Cause: internal use only
  *Action: this event should never be set externally
  *Comment: commands a slave to leave the server group (query failed)

  The ORA-10388 is an internal oracle message to terminate a parallel query when the operation is failed.
  This message should not be visible in alert.log

ORA-10387, "parallel query server interrupt (normal)"
  *Cause: internal use only
  *Action: this event should never be set externally
  *Comment: commands a slave to leave the server group normally

ORA-10382,  "parallel query server interrupt (reset)"            
  *Cause: internal use only                                       
  *Action: this event should never be set externally              
  *Comment: commands a slave to clear its buffers and reset itself


Appendix-2. Xplan



SQL_ID  3z4qz1d4taabm, child number 0
-------------------------------------
WITH SQ AS (SELECT /*+ materialize  */ X FROM TEST_TAB) SELECT /*+ 
parallel(2) */ T.X FROM TEST_TAB T, SQ WHERE ROWNUM <= :B1
 
Plan hash value: 1215440870
 
---------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name                        | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                             |       |       |   170 (100)|          |        |      |            |
|   1 |  TEMP TABLE TRANSFORMATION             |                             |       |       |            |          |        |      |            |
|   2 |   PX COORDINATOR                       |                             |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)                 | :TQ10000                    |   327 |  4251 |     2   (0)| 00:00:01 |  Q1,00 | P->S | QC (RAND)  |
|   4 |     LOAD AS SELECT (TEMP SEGMENT MERGE)| SYS_TEMP_0FD9D664E_DBF2D292 |       |       |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR                 |                             |   327 |  4251 |     2   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|*  6 |       TABLE ACCESS FULL                | TEST_TAB                    |   327 |  4251 |     2   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|*  7 |   COUNT STOPKEY                        |                             |       |       |            |          |        |      |            |
|   8 |    PX COORDINATOR                      |                             |       |       |            |          |        |      |            |
|   9 |     PX SEND QC (RANDOM)                | :TQ20001                    |   106K|  1357K|   168   (0)| 00:00:01 |  Q2,01 | P->S | QC (RAND)  |
|* 10 |      COUNT STOPKEY                     |                             |       |       |            |          |  Q2,01 | PCWC |            |
|  11 |       MERGE JOIN CARTESIAN             |                             |   106K|  1357K|   168   (0)| 00:00:01 |  Q2,01 | PCWP |            |
|  12 |        PX BLOCK ITERATOR               |                             |   327 |  4251 |     2   (0)| 00:00:01 |  Q2,01 | PCWC |            |
|* 13 |         TABLE ACCESS FULL              | TEST_TAB                    |   327 |  4251 |     2   (0)| 00:00:01 |  Q2,01 | PCWP |            |
|  14 |        BUFFER SORT                     |                             |   327 |       |   166   (0)| 00:00:01 |  Q2,01 | PCWP |            |
|  15 |         PX RECEIVE                     |                             |   327 |       |            |          |  Q2,01 | PCWP |            |
|  16 |          PX SEND BROADCAST             | :TQ20000                    |   327 |       |            |          |  Q2,00 | P->P | BROADCAST  |
|  17 |           VIEW                         |                             |   327 |       |            |          |  Q2,00 | PCWP |            |
|  18 |            PX BLOCK ITERATOR           |                             |   327 |  4251 |     2   (0)| 00:00:01 |  Q2,00 | PCWC |            |
|* 19 |             TABLE ACCESS FULL          | SYS_TEMP_0FD9D664E_DBF2D292 |   327 |  4251 |     2   (0)| 00:00:01 |  Q2,00 | PCWP |            |
---------------------------------------------------------------------------------------------------------------------------------------------------
 
Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
 
   1 - SEL$2       
   2 - SEL$1       
   6 - SEL$1        / TEST_TAB@SEL$1
  13 - SEL$2        / T@SEL$2
  17 - SEL$D67CB2D2 / SQ@SEL$2
  19 - SEL$D67CB2D2 / T1@SEL$D67CB2D2
 
Outline Data
-------------
  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('19.1.0')
      DB_VERSION('19.1.0')
      OPT_PARAM('optimizer_dynamic_sampling' 1)
      OPT_PARAM('_optimizer_use_feedback' 'false')
      OPT_PARAM('_optimizer_gather_stats_on_load' 'false')
      OPT_PARAM('_px_adaptive_dist_method' 'off')
      OPT_PARAM('_optimizer_strans_adaptive_pruning' 'false')
      OPT_PARAM('_optimizer_nlj_hj_adaptive_join' 'false')
      OPT_PARAM('optimizer_index_cost_adj' 50)
      OPT_PARAM('optimizer_index_caching' 95)
      OPT_PARAM('_fix_control' '23473108:0 20424684:0 22582700:0')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$1")
      OUTLINE_LEAF(@"SEL$D67CB2D2")
      MATERIALIZE(@"SEL$1")
      OUTLINE_LEAF(@"SEL$2")
      FULL(@"SEL$2" "T"@"SEL$2")
      NO_ACCESS(@"SEL$2" "SQ"@"SEL$2")
      LEADING(@"SEL$2" "T"@"SEL$2" "SQ"@"SEL$2")
      USE_MERGE_CARTESIAN(@"SEL$2" "SQ"@"SEL$2")
      PQ_DISTRIBUTE(@"SEL$2" "SQ"@"SEL$2" NONE BROADCAST)
      FULL(@"SEL$D67CB2D2" "T1"@"SEL$D67CB2D2")
      FULL(@"SEL$1" "TEST_TAB"@"SEL$1")
      END_OUTLINE_DATA
  */
Peeked Binds (identified by position):
--------------------------------------
   1 - :B1 (NUMBER): 3
 
Predicate Information (identified by operation id):
---------------------------------------------------
   6 - access(:Z>=:Z AND :Z<=:Z)
   7 - filter(ROWNUM<=:B1)
  10 - filter(ROWNUM<=:B1)
  13 - access(:Z>=:Z AND :Z<=:Z)
  19 - access(:Z>=:Z AND :Z<=:Z)
 
Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 2
---------------------------------------------------------------------------
   0 -  STATEMENT
           -  parallel(2)
 
   2 -  SEL$1
           -  materialize
Note
-----
   - Degree of Parallelism is 2 because of hint