We start the test by (see appended Test Code):
begin
ksun_mutex_test_jobs(64, 2);
end;
which launches 64 sessions running ksun_mutex_testp1 and 2 ksun_mutex_testp2.
ksun_mutex_testp1 runs the same sql statement by varying the NLS settings.
ksun_mutex_testp2 runs a query over v$sql in order to simulate some monitoring applications on sql statements. v$sql is chosen also because it is derived from x$kglcursor_child, which means any select on it will touch all child_cursors.
ksun_mutex_testp2 accelerates the collision, but does not play a decisive role in the test.
Here is the observed "particles":
1. Number of child_cursors
In theory, ksun_mutex_testp1 generates at most 800 (40 nls_language multiplied 20 nls_territory) child_cursors for the statement (sql_id: '754r1k9db5u80'):
select id into l_id from testt where name = :B1;
but
select count(*) from v$sql where sql_id ='754r1k9db5u80';
>>> 66'510
(with certain fluctuations, but more than 10'000)
10Jan2013 Update: Tested on new Oracle 11.2.0.3.0,
select count(*), min(child_number), max(child_number) from v$sql where sql_id ='754r1k9db5u80';
=> 23'027 0 99
count(*) can no more reach 66'510.
select count(*), min(child_number), max(child_number) from v$sql where sql_id ='754r1k9db5u80' and is_obsolete = 'N';
=> 78 0 77
It seems that "_cursor_obsolete_threshold" is active.
2. Shared memory
select sharable_mem, persistent_mem, runtime_mem from v$sqlarea where sql_id ='754r1k9db5u80';
>>> 845'643'351 292'590'880 226'019'744
3. Reason for child_cursors
select extract (xmltype ('<kroot>' || reason || '</kroot>'), '//ChildNumber').getStringVal () child
,extract (xmltype ('<kroot>' || reason || '</kroot>'), '//reason').getStringVal () reason
from v$sql_shared_cursor
where sql_id ='754r1k9db5u80';
>>> <ChildNumber>x</ChildNumber> <reason>NLS Settings(0)</reason>
where x is the child_cursor number.
4. New MUTEX_TYPE: "hash table"
select * from v$mutex_sleep where mutex_type = 'hash table' order by sleeps;
>>> hash table kkshGetNextChild [KKSHBKLOC1] 1289912 0
I wonder if this is related to db_block_hash_buckets subpool in shared pool:
select * from v$sgastat where pool = 'shared pool' and name = 'db_block_hash_buckets';
>>> shared pool db_block_hash_buckets 2920448
Update in 18Mar2013:
db_block_hash_buckets is for database block hash buckets. It is allocated in shared pool.
It takes about 1% of db_cache_size for db_block_size = 8192, or 70 Bytes for each database block hash bucket (chain).
For example, a DB with db_cache_size=262GB, db_block_hash_buckets needs 2848MB in shared pool.
5. System frozen
The most astonishing detection is that all started sessions are blocked after dozens of hours or even a couple of days by one session which is waiting either for an event:
"library cache: mutex X"
or
"library cache lock".
The waiters can be either "library cache: mutex X" or "library cache lock".
By dumping systemstate for the case of final blocker: library cache lock, we can see:
sid: 446 ser: 2025
Current Wait Stack:
Not in wait; last wait ended 1118 min 2 sec ago
There are 65 sessions blocked by this session.
Dumping one waiter:
inst: 1, sid: 42, ser: 1249
wait event: 'library cache lock'
p1: 'handle address'=0xde5d11a8
p2: 'lock address'=0xe019ed68
p3: '100*mode+namespace'=0x520002
sid: 42 ser: 1249
Current Wait Stack:
0: waiting for 'library cache lock'
handle address=0xde5d11a8, lock address=0xe019ed68, 100*mode+namespace=0x520002
wait_id=78226 seq_num=20283 snap_id=347
wait times: snap=51.136124 sec, exc=1118 min 2 sec, total=1118 min 2 sec
wait times: max=infinite, heur=1118 min 2 sec
wait counts: calls=22362 os=22362
in_wait=1 iflags=0x15a2
There is at least one session blocking this session.
Dumping 1 direct blocker(s):
inst: 1, sid: 446, ser: 2025
v$locks lists that final blocker session 446 holds 3 locks of type: CU, AE, JQ, and does not request any locks. Further check shows that only this session holds one CU lock. CU means "Recovers cursors in case of death while compiling", so this final blocker is probably falling into an endless loop of cursor rescuing process.
If one tries to deblock the situation by killing the blocking session, one of the blocked sessions will take over its role, and the original blocking session joins to the blocked sessions. So the only fix could be to reboot the DB.
On the UNIX level, we can see this process is more than 95% on Usr CPU.
As checked from time to time, the system is getting gradually slower and finally totally hanging on one single session.
The whole tests were carried out on Solaris(x86-64) and AIX. Similar blocking behaviors were also reported by some real Oracle 11gR2 applications.
Probably these "particles" will not need 45-year to be confirmed by Oracle.
By the way, with Oracle 11.2.0.3.0, a new parameter is introduced:
Name: _cursor_obsolete_threshold
Description: Number of cursors per parent before obsoletion.
Default value: 100
Addendum (2015.03.25): Looking at Oracle 11.2.0.4.0, the default value is increased to 1024.
It seems Oralce does not strictly follow this threshold, and eventually can cause Shared Pool explosion due to number of child cursor versions.
Test Code (Oralce 11.2.0.3.0):
drop table testt;
create table testt as select 1 id, 'ksun' name from dual;
create table ksun_mutex_sleep as select * from v$mutex_sleep where 1=2;
create or replace procedure ksun_mutex_testp1 as
l_id number;
l_name varchar2(10) := 'ksun';
type tab is table of varchar2(100);
l_tab_lang tab;
l_tab_terr tab;
begin
select value bulk collect into l_tab_lang from v$nls_valid_values where parameter='LANGUAGE';
select value bulk collect into l_tab_terr from v$nls_valid_values where parameter='TERRITORY';
for j in 1..least(40, l_tab_lang.count) loop
--execute immediate q'[alter session set nls_language=']'||l_tab_lang(j)||q'[']';
dbms_session.set_nls('nls_language', ''''||l_tab_lang(j)||'''');
for k in 1..least(20, l_tab_terr.count) loop
--execute immediate q'[alter session set nls_territory=']'||l_tab_terr(k)||q'[']';
dbms_session.set_nls('nls_territory', ''''||l_tab_terr(k)||'''');
select id into l_id from testt where name = l_name;
end loop;
end loop;
end;
/
create or replace procedure ksun_mutex_testp2 as
begin
for c in (select * from v$sql) loop
null;
end loop;
end;
/
create or replace procedure ksun_mutex_test_jobs(p_job_cnt1 number, p_job_cnt2 number)
as
l_job_id pls_integer;
begin
for i in 1.. p_job_cnt1 loop
dbms_job.submit(l_job_id, 'ksun_mutex_testp1;', interval => 'sysdate');
end loop;
for i in 1.. p_job_cnt2 loop
dbms_job.submit(l_job_id, 'ksun_mutex_testp2;', interval => 'sysdate');
end loop;
commit;
end;
/