Thursday, August 27, 2015

Latch _SPIN_COUNT: adaptive

The pseudo-code in Page 241 of Expert Oracle Database Architecture(3rd Edition) - Section: Latch "Spinning" shows that each latch sleep occurs after 2000 Get Requests. According to that program logic, query:

select snap_id, latch_name, gets, misses, sleeps, spin_gets, wait_time
      ,round(gets/sleeps)               gets_per_sleep
      ,((sleeps + spin_gets) - misses)  delta
      ,round(wait_time/sleeps)          wait_time_per_sleep
from
  (select snap_id, latch_name
         ,gets - lag(gets) over (partition by latch_name order by snap_id) gets
         ,misses - lag(misses) over (partition by latch_name order by snap_id) misses
         ,sleeps - lag(sleeps) over (partition by latch_name order by snap_id) sleeps
         ,spin_gets - lag(spin_gets) over (partition by latch_name order by snap_id) spin_gets
         ,wait_time - lag(wait_time) over (partition by latch_name order by snap_id) wait_time
   from   dba_hist_latch
   where latch_name = 'shared pool'
  )
where sleeps > 0
order by snap_id, latch_name;


should return gets_per_sleep >= 2000.

(Column delta is discussed in Blog: Is latch misses statistic gathered or deduced ?)

However running the test code "instest.java.NOBIND" provided by the Book, we observe rows whose value is around 200, much less than 2000 for "latch: shared pool", and also rows with value bigger than 20,000.

 Latch Name     Requests  Misses  Sleeps  Gets (Spin Gets)
 -------------- --------  ------- ------- ----------------
 shared pool    2,296,041 75,240  15,267  60,165


confirms the above observation:

 gets_per_sleep = 2,296,041 / 15,267         = 150
 delta          = (15,267 + 60,165) - 75,240 = 192


The pattern seems that gets_per_sleep decreases when hard parsing increases.

It looks like that Oracle tends to make performance more and more adaptive, in my humble opinion, called system self-tuning or self-learning. Probably it uses some Adaptive Control method, which adapts "_SPIN_COUNT" based on feedback of latches' workload.

MOS Note (Doc ID 1970450.1) on "_SPIN_COUNT" (amount to spin waiting for a latch) said:

The default value of the parameter is automatically adjusted when the machine's CPU count changes provided that the default was used. If the parameter was explicitly set, then there is no change. It is not usually recommended to change the default value for this parameter.

If after this entire loop, the latch is still not available, the process must yield the CPU and go to sleep. Initially, it sleeps for one centisecond. This time is doubled in every subsequent sleep.

gets_per_sleep < 2000 could also indicate that internally Oracle halves (or reduce) "_SPIN_COUNT" when it doubles sleep time.

2000 is the default value of "_SPIN_COUNT".

Changing it by:
 alter system set "_SPIN_COUNT"=4000;
seems no influence on the above latch statistics.

Wednesday, July 8, 2015

Redo Practice

Although "Undo is more complicated than redo" (Oracle Core: Essential Internals for DBAs and Developers), Redo performance is more visible to applications. End user is affected by "log file sync", while DBA is googling Panacea of "log file parallel write". It travels through all layers of computer system, from applications, Oracle, OS (Scheduler, VM, FS), adapter, down to disk (LUNS, RAID) and network.

Following previous Blog: "UNDO Practice", it is time to do the second exercise in sticking to:
"The Beginners' Guide to Becoming an Oracle Expert" (Page 5).

All tests are done in Oracle 11.2.0.4.0 (see appended Test Code).

1. Asynchronous Commit


The default PL/SQL commit behavior for nondistributed transactions is BATCH NOWAIT if the COMMIT_LOGGING and COMMIT_WAIT database initialization parameters have not been set (Database PL/SQL Language Reference).

Run:

 exec create_awr;
 exec update_test_tab(1000, 1);
 exec create_awr;


AWR shows:

Statistic Total / Waits per Second per Trans
redo size 11,369,340 943,435.40 11,245.64
user commits 1,011 83.89 1.00
redo synch writes 2 0.17 0.00
redo writes 971 80.57 0.96
log file sync 2
log file parallel write 972

Table-1

For 1000 updates and 1000 user commits, it requires 971 "redo write" which triggered almost same number of "log file parallel write" (972), very close to the number of user commits. However, only 2 "redo synch write", hence 2 "log file sync". That is the effect of Asynchronous Commit.

Note that "select for update" shows the similar Redo behaviour as "update" even though there is no real update (see Test Code 2).

This Redo optimization is also effective for Oracle Server-Side Internal Driver JVM.

2. Synchronous Commit


Run "script_1 1000", AWR shows:


Statistic Total / Waits per Second per Trans
redo size 11,202,868 3,002,644.87 11,080.98
user commits 1,011 270.97 1.00
redo synch writes 1,001 268.29 0.99
redo writes 1,036 277.67 1.02
log file sync 1,001
log file parallel write 1,034

Table-2

In case of 1000 Synchronous Commits, all 5 statistics (not including "redo size") are almost same. Each user commit leads to one respective event.

"redo synch write" represents the number of times the redo is forced to disk immediately, usually for a transaction commit.

Client-Side JDBC connection has the similar behaviour. But it is possible to switch off default auto-commit by:
   connection.setAutoCommit(false).

3. Piggybacked Commit


Piggybacked Commit reduces the number of redo writes caused by commits by grouping redo records from several sessions together (Oracle Mos WAITEVENT: "log file sync" Reference Note (Doc ID 34592.1)).

Run:

 exec create_awr;
 exec update_test_tab_loop(1000, 10);
 exec dbms_lock.sleep(120); 
 -- wait for job finished
 exec create_awr;


AWR shows:


Statistic Total / Waits per Second per Trans
redo size 18,138,376 136,033.06 1,808.05
user commits 10,032 75.24 1.00
redo synch writes 12 0.09 0.00
redo writes 440 3.30 0.04
log file sync 12
log file parallel write 439

Table-3

Comparing to Table-1, although we make 10,000 updates and user commits, "redo size" is not 10 times that of Table-1 (18,138,376 vs. 11,369,340), and "redo write" (also "log file parallel write") is even less than that of Table-1 (440 vs. 971). That is the role Piggybacked Commit plays.

LGWR truss outputs:

  listio64(0x0000000010000004, 0x000000000FFFFFFF, 0x00000000FFFDB4D0, 0x0000000000000002, ...) = 0x0000000000000000
  aio_nwait64(0x0000000000001000, 0x0000000000000002, 0x0FFFFFFFFFFEB4D0, 0x800000000000D032, ...) = 0x0000000000000002
  thread_post_many(7, 0x0FFFFFFFFFFF3488, 0x0FFFFFFFFFFF3480) = 0
  listio64(0x0000000010000004, 0x000000000FFFFFFF, 0x00000000FFFDB4D0, 0x0000000000000002, ...) = 0x0000000000000000
  aio_nwait64(0x0000000000001000, 0x0000000000000002, 0x0FFFFFFFFFFEB4D0, 0x800000000000D032, ...) = 0x0000000000000002
  thread_post_many(4, 0x0FFFFFFFFFFF3488, 0x0FFFFFFFFFFF3480) = 0
  listio64(0x0000000010000004, 0x000000000FFFFFFF, 0x00000000FFFDB4D0, 0x0000000000000002, ...) = 0x0000000000000000
  aio_nwait64(0x0000000000001000, 0x0000000000000002, 0x0FFFFFFFFFFEB4D0, 0x800000000000D032, ...) = 0x0000000000000002
  thread_post_many(6, 0x0FFFFFFFFFFF3488, 0x0FFFFFFFFFFF3480) = 0

We can see LGWR "thread_post_many" several sessions (nthreads is the first parameter of "thread_post_many") after each "aio_nwait64". It means that redo of multiple threads (oracle sessions) are written by one single "listio64" and are posted at the same time.

4. Distributed Transactions


We will look at Oracle-Controlled Distributed Transactions (using Database Link).

Run "script_2 1000", AWR in dblocal (Commit Point Site) shows:


Statistic Total / Waits per Second per Trans
redo size 12,135,636 394,513.70 12,003.60
user commits 1,011 32.87 1.00
redo synch writes 3,003 97.62 2.97
redo writes 3,038 98.76 3.00
log file sync 3,002
log file parallel write 3,116
transaction branch allocation 4,012

Table-4


AWR in dbremote shows:

Statistic Total / Waits per Second per Trans
redo size 11,315,104 371,681.63 11,191.99
user commits 1,011 33.21 1.00
redo synch writes 2,003 65.80 1.98
redo writes 2,035 66.85 2.01
log file sync 2,002
log file parallel write 2,036
transaction branch allocation 9,013

Table-5

Comparing to Table-2 of 1000 Synchronous Commits in nondistributed transactions, 1000 updates and 1000 user commits in Distributed Transactions demand for 3 times of redo events in Commit Point Site, and 2 times in other node. "transaction branch allocation" is 4 times, respectively, 9 times.

Each "redo synch write" is translated as one UNIX System Call "pwrite", which can be watched by truss or dtrace command in Solaris. Therefore there are 3,003 pwrites in dblocal, and 2,003 pwrites in dbremote.

By dumping Redo Log only for the "commit" command in dblocal and dbremote with:
     ALTER SYSTEM DUMP LOGFILE '<full_path_logfile_name>';

CHANGE vectors in dblocal are:

CLS:117 AFN:3 DBA:0x00c00290 OBJ:4294967295 SCN:0x0853.9628ab8a SEQ:1 OP:5.2 
CLS:118 AFN:3 DBA:0x00c00550 OBJ:4294967295 SCN:0x0853.9628ab8a SEQ:1 OP:5.1 
CLS:118 AFN:3 DBA:0x00c00550 OBJ:4294967295 SCN:0x0853.9628ab8d SEQ:1 OP:5.1 
CLS:117 AFN:3 DBA:0x00c00290 OBJ:4294967295 SCN:0x0853.9628ab8d SEQ:1 OP:5.12 
CLS:117 AFN:3 DBA:0x00c00290 OBJ:4294967295 SCN:0x0853.9628ab8d SEQ:2 OP:5.12 
CLS:117 AFN:3 DBA:0x00c00290 OBJ:4294967295 SCN:0x0853.9628ab8f SEQ:1 OP:5.4 

CHANGE vectors in dbremote are:

CLS:123 AFN:3 DBA:0x00c002b0 OBJ:4294967295 SCN:0x0853.9628ac1d SEQ:1 OP:5.2
CLS:124 AFN:3 DBA:0x00c009ba OBJ:4294967295 SCN:0x0853.9628ac5c SEQ:1 OP:5.1
CLS:124 AFN:3 DBA:0x00c009ba OBJ:4294967295 SCN:0x0853.9628ac6a SEQ:1 OP:5.1
CLS:123 AFN:3 DBA:0x00c002b0 OBJ:4294967295 SCN:0x0853.9628ac6a SEQ:1 OP:5.4

we can see that "commit" in dblocal includes two additional "OP:5.12" CHANGE vectors on UNDO header (CLS:117), apart from OP:5.2 (change for update transaction table in undo segment header), OP:5.1 (Change for update of the undo block), OP:5.4 (change for Commit). Probably "OP:5.12" is particular for 2pc commit.

As documented in Oracle Two-Phase Commit Mechanism, Commit Point Site experiences 3 Phases: Prepare/Commit/Forget Phases, whereas other resource nodes perform only first 2 Phases.

A redo record, also called a redo entry, is made up of a group of change vectors. In AWR report, we can also check "redo entries".

Each committed transaction has an associated system change number (SCN) to uniquely identify the changes made  by the SQL statements within that transaction.

During the prepare phase, the database determines the highest SCN at all nodes involved in the transaction. The transaction then commits with the high SCN at the commit point site. The commit SCN is then sent to all prepared nodes with the commit decision (Distributed Transactions Concepts).

alert log of dblocal contains the following text when first time to make the Distributed Transaction.

 Fri Jul 03 13:48:00 2015
 Advanced SCN by 29394 minutes worth to 0x0824.738bd480, by distributed transaction logon, remote DB: DBREMOTE.COM.
  Client info: DB logon user TESTU, machine dbremote, program oracle@dbremote (TNS V1-V3), and OS user oracle
 Fri Jul 03 13:50:04 2015 

Following two queries can help monitor Distributed Transactions.

 select 'local' db, v.* from v$global_transaction v  union all
 select 'remote' db, v.* from v$global_transaction@dblinkremote v;
 
 select 'local' db, v.* from v$lock v where type in ('TM', 'TX', 'DX')  union all
 select 'remote' db, v.* from v$lock@dblinkremote v where type in ('TM', 'TX', 'DX');

For XA Based Distributed Transactions (X/Open DTP), the similar behaviour can be observed.
Additionally in AWR - Enqueue Activity shows "DX-Distributed Transaction".

5. Distributed Transaction Commit


We can commit the whole distributed transaction in local db:
   update test_redo set name = 'update_local_1' where id = 1;
   update
test_redo@dblinkremote set name = 'update_remote_1' where id = 1;
   commit;

or commit in remote db:
   update test_redo set name = 'update_local_1' where id = 1;
   update
test_redo@dblinkremote set name = 'update_remote_1' where id = 1;
   exec dblinkremote_commit;


6. Distributed Transaction with autonomous_transaction


We can make one more sophisticated test by incorporating one autonomous_transaction in remote DB.
Run "script_3 1000", AWR in dblocal (Commit Point Site) shows:

Statistic Total / Waits per Second per Trans
redo size 12,010,796 483,565.34 11,880.11
user commits 1,011 40.7 1.00
redo synch writes 3,004 120.94 2.97
redo writes 3,041 122.43 3.01
log file sync 3,004
log file parallel write 3,042
transaction branch allocation 6,010

Table-6

AWR in dbremote shows:

Statistic Total / Waits per Second per Trans
redo size 12,156,052 499,673.30 6,047.79
user commits 2,010 82.62 1.00
redo synch writes 3,002 123.4 1.49
redo writes 3,035 124.75 1.51
log file sync 3,004
log file parallel write 3,034
transaction branch allocation 10,019

Table-7

In dblocal, Table-4 and Table-6 are almost identical.

In dbremote, the numbers in Table-7 is higher than Table-5 due to autonomous_transaction. It looks like no Piggybacked Commit in effect.

The above test case reflects the complexity of three-tier architecture, where middle tier and data tier synchronize atomic data operations with Distributed Transaction. In case of Oracle as data tier, it can involve autonomous_transactions.

7. Distributed Transaction: distributed_lock_timeout


In remote db, set distributed_lock_timeout by:
       alter system set distributed_lock_timeout=27 scope=spfile;
then restart remote db:
       startup force;
in remote db, run;
       update test_redo set name = 'update_remote_0' where id = 1;
and then in local db, execute:
   update test_redo set name = 'update_local_1' where id = 1;
   update
test_redo@dblinkremote set name = 'update_remote_1' where id = 1;
after 27 seconds (waiting Event "enq: TX - contention"), it returns:
   ORA-02049: timeout: distributed transaction waiting for lock
   ORA-02063: preceding line from DBLINKREMOTE


ORA-02049 occurs when the DML modified segment is blocked more than distributed_lock_timeout, for example, if the segment is stored in Oracle bigfile tablespace, and pre-allocation takes more than distributed_lock_timeout. (see Blog: Oracle bigfile tablespace pre-allocation and session blocking).

Test Setup


Setup two DBs, one named dblocal, another dbremote.

Run Test Code 1 on both DBs, additionally run Test Code 2 on dblocal.
Note: adapt database name, user_id, password at first.

---------------- Test Code 1 ----------------

drop table test_redo; create table test_redo ( id int primary key using index (create index ind_p on test_redo (id)), name varchar2(300)); insert into test_redo select level, rpad('abc', 100, 'x') y from dual connect by level <= 10000; exec dbms_stats.gather_table_stats(null, 'TEST_REDO', cascade => true); create or replace procedure create_awr as begin sys.dbms_workload_repository.create_snapshot('ALL'); end; / create or replace procedure db_commit as begin commit; end; / create or replace procedure db_commit_autotrx(i number) as pragma autonomous_transaction; begin update test_redo set name = i||'_autonomous_transaction_at_'||localtimestamp where id = (10000 - (i - 1)); commit; end; /

---------------- Test Code 2 ----------------

drop database link dblinkremote; create database link dblinkremote connect to k identified by k using 'dbremote'; create or replace procedure update_test_tab(p_cnt number, p_job number) as begin for i in 1.. p_cnt loop update test_redo set name = rpad('abc', 100, i) where id = (p_job -1) * 1000 + mod(i, 1001); ---- get similar AWR Redo figures for select for update --for c in (select name from test_redo where id = (p_job -1) * 1000 + mod(i, 1001) for update) --loop null; end loop; commit; end loop; end; / create or replace procedure update_test_tab_loop(p_cnt number, p_job_cnt number) as l_job_id pls_integer; begin for i in 1.. p_job_cnt loop dbms_job.submit(l_job_id, 'update_test_tab('||p_cnt||', '|| i||');'); end loop; commit; end; / create or replace procedure dblinkremote_commit as begin db_commit@dblinkremote; end; / create or replace procedure dblinkremote_commit_autotrx(i number) as begin db_commit_autotrx@dblinkremote(i); end; /

---------------- script_1 ----------------

#!/bin/ksh sqlplus -s testu/testp |& print -p "exec create_awr;" print -p "set feedback off" i=0 while (( i < $1 )) do (( i+=1 )) echo $i print -p "update test_redo set name = rpad('abc', 100, $i) where id = mod($i, 1001);" print -p "commit;" done print -p "exec create_awr;" print -p "exit 0"

---------------- script_2 ----------------

#!/bin/ksh sqlplus -s testu/testp |& print -p "exec create_awr;" print -p "exec create_awr@dblinkremote;" print -p "set feedback off" i=0 while (( i < $1 )) do (( i+=1 )) echo $i print -p "update test_redo set name = rpad('abc', 100, $i) where id = mod($i, 1001);" print -p "update test_redo@dblinkremote set name = rpad('abc', 100, $i) where id = mod($i, 1001);" print -p "commit;" done print -p "exec create_awr;" print -p "exec create_awr@dblinkremote;" print -p "exit 0"

---------------- script_3 ----------------

#!/bin/ksh sqlplus -s testu/testp |& print -p "exec create_awr;" print -p "exec create_awr@dblinkremote;" print -p "set feedback off" i=0 while (( i < $1 )) do (( i+=1 )) echo $i print -p "update test_redo set name = rpad('abc', 100, $i) where id = mod($i, 1001);" print -p "update test_redo@dblinkremote set name = rpad('abc', 100, $i) where id = mod($i, 1001);" print -p "exec dblinkremote_commit_autotrx(mod($i, 1001));" print -p "commit;" done print -p "exec create_awr;" print -p "exec create_awr@dblinkremote;" print -p "exit 0"

Thursday, July 2, 2015

UNDO Practice

An expert is a person who has found out by his own painful experience all the mistakes that one can make in a very narrow field.
Niels Bohr

Book Oracle Core: Essential Internals for DBAs and Developers declared that change vector (the heart of redo and undo) is most important feature of Oracle (Page 5). While redo data(after image) is written to be forgotten, undo data(before image) is active over the life time of instance. Therefore it is worth of doing a couple of exercises on UNDO by following:
"The Beginners' Guide to Becoming an Oracle Expert" (Page 5).

All tests are done in Oracle 11.2.0.4.0 with undo_management=AUTO and db_block_size=8192.

UNDO dump


The first test code is to practice what one learned from the book (see the book for more details).

Run Test Code 1, output looks like:

------------ testhost_ora_25106_block_1.trc ------------
Start dump data blocks tsn: 1172 file#:863 minblk 2082662 maxblk 2082662
scn: 0x0824.e789f0b4 seq: 0x02 flg: 0x04 tail: 0xf0b40602
0x02   0x0078.01c.000006ba  0x00c0054a.03c8.01  ----    1  fsc 0x0000.00000000
bdba: 0x001fc766


------------ testhost_ora_25106_block_2.trc ------------
Start dump data blocks tsn: 1172 file#:863 minblk 2082663 maxblk 2082663
scn: 0x0824.e789f0b8 seq: 0x02 flg: 0x04 tail: 0xf0b80602
0x02   0x0078.01c.000006ba  0x00c0054b.03c8.01  ----    1  fsc 0x0000.00000000
bdba: 0x001fc767


------------ testhost_ora_25106_block_3.trc ------------
Start dump data blocks tsn: 1172 file#:863 minblk 2082659 maxblk 2082659
scn: 0x0824.e789f0bc seq: 0x02 flg: 0x04 tail: 0xf0bc0602
0x02   0x0078.01c.000006ba  0x00c0054c.03c8.01  ----    1  fsc 0x0000.00000000
bdba: 0x001fc763


------------ testhost_ora_25106_undo_1.trc ------------
Start dump data blocks tsn: 2 file#:3 minblk 1354 maxblk 1354
BH (0xc0f6fe38) file#: 3 rdba: 0x00c0054a (3/1354) class: 256 ba: 0xc013a000
scn: 0x0824.e789f0b4 seq: 0x01 flg: 0x04 tail: 0xf0b40201
xid: 0x0078.01c.000006ba  seq: 0x3c8 cnt: 0x1   irb: 0x1   icl: 0x0   flg: 0x0000
rdba: 0x00c00549
op: C  uba: 0x00c00549.03c8.38


------------ testhost_ora_25106_undo_2.trc ------------
Start dump data blocks tsn: 2 file#:3 minblk 1355 maxblk 1355
BH (0xcef64d80) file#: 3 rdba: 0x00c0054b (3/1355) class: 256 ba: 0xce018000
scn: 0x0824.e789f0b8 seq: 0x02 flg: 0x04 tail: 0xf0b80202
xid: 0x0078.01c.000006ba  seq: 0x3c8 cnt: 0x1   irb: 0x1   icl: 0x0   flg: 0x0000
rdba: 0x00c0054a
op: C  uba: 0x00c0054a.03c8.02


------------ testhost_ora_25106_undo_3.trc ------------
Start dump data blocks tsn: 2 file#:3 minblk 1356 maxblk 1356
BH (0xd2fe5930) file#: 3 rdba: 0x00c0054c (3/1356) class: 256 ba: 0xd2d4c000
scn: 0x0824.e789f0bc seq: 0x02 flg: 0x04 tail: 0xf0bc0202
xid: 0x0078.01c.000006ba  seq: 0x3c8 cnt: 0x1   irb: 0x1   icl: 0x0   flg: 0x0000
rdba: 0x00c0054b
op: C  uba: 0x00c0054b.03c8.02


1. Above output showed that all UNDO blocks are linked with "rdba" field, from last to first
    (0x00c0054c -> 0x00c0054b -> 0x00c0054a).

2. The modified ITL in each Data block is also saved in its UNDO block's field marked with:
    "op: C  uba:". So all modified information in Data block are recorded in Before Image.

UNDO Size


We can make an UNDO test by updating one block many times, and querying an un-modified block in a different session. The purpose is trying to extend UNDO length (Space) to infinitive.

Open one session, update first row 1000 times, without commit:

 begin
   for i in 1..1000 loop
     update test_tab set name = rpad('a', 3000, i) where id = '1';
   end loop;
 end;
 /

Open a second session, read second row:

 select value from v$sysstat where name = 'data blocks consistent reads - undo records applied';
 select count(*) from test_tab t where id = '2';
 select value from v$sysstat where name = 'data blocks consistent reads - undo records applied';

The output:

 7401710432
 1
 7401711432

shows that there are 1000 (7401711432 - 7401710432) "data blocks consistent reads - undo records applied".

Even though the second row is not modified and in a different block, it still needs to apply whole UNDO records to make the CR read:

(More statistics can be listed by RUNSTATS in Expert Oracle Database Architecture)


UNDO Duration


Another UNDO test is by continuously inserting rows without commit, and querying in a few other sessions. The purpose is trying to extend UNDO duration (Time) to infinitive.

The similar pattern of code could probably be found in some real applications.
(see Test Code 2)

Start the test by:

 exec insert_no_commit_loop(1);
 exec test_select_loop(24);

From time to time, run following query, we could see elapsed_per_exec of select statement is gradually increasing. At beginning, it is some milliseconds, after a few hours, it could reach a couple of minutes.

select executions, disk_reads
      ,rows_processed, round(rows_processed/executions, 2) rows_per_exec
      ,buffer_gets, round(buffer_gets/executions, 2) buffer_per_exec
      ,round(elapsed_time/1e3) elapsed_time, round(elapsed_time/1e3/executions, 2) elapsed_per_exec
      ,v.*
from v$sql v where lower(sql_text) like '%test_tab2%' and v.executions > 0
order by v.executions desc;


Deeper into Buffer Cloning


In the Section: Deeper into Buffer Cloning (Page 244) of  Oracle Performance Firefighting (4th Printing June 2010) there is some text:

The SCN is 12320, which is before our query stated at time 12330. Therefore, we do not apply the undo. If we did apply the undo, our CR buffer would represent a version of block 7,678 at time
12320, which is too early!

Probably even though 12320 is before 12330, we should also apply this undo since it belongs to an opened transaction. In fact, we should reverse all blocks in this un-committed transaction.

Otherwise, the elapsed_per_exec of above select statement would arrive at a fixpoint (might be Nash equilibrium ?).

In Page 242, text:

If you recall, when a server process locates a desired buffer and discovers a required row has changed since its query began, it must create a back-in-time image of the buffer.

Same as above argument, even a required row is changed before its query began, it should also be reversed if not yet committed.


Temporary Table (GTT): UNDO / Redo


As we know, Oracle fulfils a DML by following 5 steps:
  1. create UNDO Change Vector
  2. create REDO Change Vector
  3. combine both Change Vectors into Redo Buffer, and then write to redo log
  4. write UNDO record into UNDO file
  5. write modified Data into DATA file
(See: Oracle Core: Essential Internals for DBAs and Developers (Page 10); Update Restart and new Active Undo Extent )

In case of GTT, applying the principle:
  Oracle redo log never records temporary data.
to prune the above 5 steps, a DML on GTT is processed by 4 steps:
  1. create UNDO Change Vector
  3. combine UNDO Change Vectors into Redo Buffer, and then write to redo log
  4. write UNDO record into UNDO file
  5. write modified Data into DATA (TEMP) file     
when setting 12c temp_undo_enabled=TRUE, it is further shortened as:
  1. create UNDO Change Vector
  4. write UNDO record into TEMP file (Remember TEMP has NEVER redo, V$TEMPUNDOSTAT)
  5. write modified Data into DATA (TEMP) file    

Test Code 1


drop table test_tab;
create table test_tab (id varchar2(3), scn number, name varchar2(3000), team varchar2(3000));
insert into test_tab values ('1', 1, rpad('a', 3000, 'a'), rpad('x', 3000, 'x'));
insert into test_tab values ('2', 2, rpad('b', 3000, 'b'), rpad('y', 3000, 'y'));
insert into test_tab values ('3', 3, rpad('c', 3000, 'c'), rpad('z', 3000, 'z'));
commit;

set lines 200
column name new_value  hostname
column mysid new_value mysid
column spid new_value  myspid
column dir new_value   mydir

select lower(name) name, sys.dbms_support.mysid mysid from v$database;
select spid from v$session s, v$process p where s.paddr=p.addr and s.sid = sys.dbms_support.mysid;
select '/orabin/app/oracle/admin/'||'&hostname'||'/diag/rdbms/'||'&hostname'||'/'||'&hostname'||'/trace' dir
from dual;

prompt hostname: &hostname, mysid: &mysid, myspid: &myspid, mydir: &mydir

create or replace directory TEST_DUMP_DIR as '&mydir';  
-- see: select * from v$diag_info where name = 'Diag Trace';

drop table dump_text_tab;

create table dump_text_tab (text varchar2(1000)) organization external
(type oracle_loader default directory TEST_DUMP_DIR
 access parameters
  (records delimited by newline CHARACTERSET AL32UTF8
   badfile aaa_db_io:'dump_text_tab.bad'
   logfile aaa_db_io:'dump_text_tab.log'
   discardfile aaa_db_io:'dump_text_tab.dsc'
   fields terminated by ','  OPTIONALLY ENCLOSED BY '"'
   missing field values are null
   reject rows with all null fields
   (text position(1:1000)))
 location ('host_ora_spid_x.trc')) reject limit unlimited;

create or replace procedure dump_read(loc varchar2, type varchar2) as
begin
  dbms_output.put_line('------------ '||loc||' ------------');
  execute immediate q'[alter table dump_text_tab location (']'||loc||q'[')]';
  if type = 'block' then
    for c in (
      select text from dump_text_tab
      where text like 'Start dump data blocks tsn:%'
         or text like 'scn:%'
         or text like '%-    1  fsc%'
         or text like 'bdba:%') loop
      dbms_output.put_line(c.text);
    end loop;
  else
  for c in (
    select text from dump_text_tab
    where text like 'Start dump data blocks tsn:%'
       or text like 'BH (%'
       or text like 'scn:%'
       or text like 'xid:%'
       or text like 'rdba:%'
       or text like 'op: C  uba:%') loop
    dbms_output.put_line(c.text);
  end loop;   
  end if;
end;
/

------------ Block 1 ------------
def id = 1

update test_tab set scn = dbms_flashback.get_system_change_number, name = upper(name), team = upper(team) where id = '&id';
-- dbms_flashback.get_system_change_number call makes test stable.

column file_id new_value  fid
column block_nr new_value blk
column location new_value loc_block_1

select dbms_rowid.rowid_to_absolute_fno(t.rowid, 'K', 'TEST_TAB') file_id
      ,dbms_rowid.rowid_block_number(t.rowid)           block_nr
      ,'&hostname'||'_ora_'||trim('&myspid')||'_block_&id'||'.trc' location
from test_tab t where id = '&id';

prompt fid: &fid, blk: &blk, loc: &loc_block_1

alter session set tracefile_identifier = 'block_&id';
alter system flush buffer_cache;
alter system dump datafile &fid block &blk; 

------------ UNDO 1 ------------
column location new_value loc_undo_1
select ubafil    file_id
      ,ubablk    block_nr
   ,'&hostname'||'_ora_'||trim('&myspid')||'_undo_&id'||'.trc' location
from v$session s, v$transaction x
where s.saddr = x.ses_addr and s.sid = sys.dbms_support.mysid;

alter session set tracefile_identifier = 'undo_&id';
alter system flush buffer_cache;
alter system dump datafile &fid block &blk; 

------------ Block 2 ------------
def id = 2

update test_tab set scn = dbms_flashback.get_system_change_number, name = upper(name), team = upper(team) where id = '&id';

column location new_value loc_block_2
select dbms_rowid.rowid_to_absolute_fno(t.rowid, 'K', 'TEST_TAB') file_id
      ,dbms_rowid.rowid_block_number(t.rowid)           block_nr
      ,'&hostname'||'_ora_'||trim('&myspid')||'_block_&id'||'.trc' location
from test_tab t where id = '&id';

alter session set tracefile_identifier = 'block_&id';
alter system flush buffer_cache;
alter system dump datafile &fid block &blk; 

------------ UNDO 2 ------------
column location new_value loc_undo_2
select ubafil    file_id
      ,ubablk    block_nr
   ,'&hostname'||'_ora_'||trim('&myspid')||'_undo_&id'||'.trc' location
from v$session s, v$transaction x
where s.saddr = x.ses_addr and s.sid = sys.dbms_support.mysid;

alter session set tracefile_identifier = 'undo_&id';
alter system flush buffer_cache;
alter system dump datafile &fid block &blk; 

------------ Block 3 ------------
def id = 3

update test_tab set scn = dbms_flashback.get_system_change_number, name = upper(name), team = upper(team) where id = '&id';

column location new_value loc_block_3
select dbms_rowid.rowid_to_absolute_fno(t.rowid, 'K', 'TEST_TAB') file_id
      ,dbms_rowid.rowid_block_number(t.rowid)           block_nr
      ,'&hostname'||'_ora_'||trim('&myspid')||'_block_&id'||'.trc' location
from test_tab t where id = '&id';

alter session set tracefile_identifier = 'block_&id';
alter system flush buffer_cache;
alter system dump datafile &fid block &blk; 

------------ UNDO 3 ------------
column location new_value loc_undo_3
select ubafil    file_id
      ,ubablk    block_nr
   ,'&hostname'||'_ora_'||trim('&myspid')||'_undo_&id'||'.trc' location
from v$session s, v$transaction x
where s.saddr = x.ses_addr and s.sid = sys.dbms_support.mysid;

alter session set tracefile_identifier = 'undo_&id';
alter system flush buffer_cache;
alter system dump datafile &fid block &blk; 

------------ Block dump read ------------
set serveroutput on
exec dump_read('&loc_block_1', 'block');
exec dump_read('&loc_block_2', 'block');
exec dump_read('&loc_block_3', 'block');

------------ UNDO dump read ------------
exec dump_read('&loc_undo_1', 'undo');
exec dump_read('&loc_undo_2', 'undo');
exec dump_read('&loc_undo_3', 'undo');

Test Code 2


create sequence test_seq;
create table test_tab2 (id number, seq_nr number, cnt number);
create index test_tab2_ind on test_tab2 (id, seq_nr, cnt) compress 1;
create type type_c100 as table of varchar2(100);
/

create or replace procedure insert_no_commit(p_cnt number) as
begin
  insert into test_tab2 select 99, test_seq.nextval, level from dual connect by level <= p_cnt;
  dbms_lock.sleep(0.1);
end;
/

create or replace procedure test_select as
  l_tab   type_c100;
begin
    select rowidtochar(rowid) bulk collect into l_tab
    from   test_tab2 where  id = 99;
    dbms_lock.sleep(0.1);
end;
/

create or replace procedure insert_no_commit_loop(p_job_cnt number)
as
   l_job_id pls_integer;
begin
    for i in 1.. p_job_cnt loop
      dbms_job.submit(l_job_id, 'begin while true loop insert_no_commit(4); end loop; end;');
    end loop;
    commit;
end;   
/

create or replace procedure test_select_loop(p_job_cnt number)
as
   l_job_id pls_integer;
begin
    for i in 1.. p_job_cnt loop
      dbms_job.submit(l_job_id, 'begin while true loop test_select; end loop; end;');
    end loop;
    commit;
end;   
/

Thursday, June 18, 2015

Temporary LOB Memory Allocation and Freeing in JAVA

This Blog demonstrates memory usage when creating and freeing a Temporary LOB, as well as Oracle 60025 Event for temp segment cleanup used for temp lobs.

Sample code is appended, and tested in Oracle 11.2.0.3.0 and 11.2.0.4.0.

Run test in 4 different combinations,  output as follows:
 
java -cp .:$ORACLE_HOME/jdbc/lib/ojdbc6.jar TempLobMemTest 2 no no
 >>> Lob insert...1
 sid, cache_lobs, nocache_lobs, abstract_lobs
 229 1 0 0
 tablespace, contents, segtype, size_mb
 TEMP TEMPORARY LOB_INDEX 2
 TEMP TEMPORARY LOB_DATA 12
 ------------------------------------------
 >>> Lob insert...2
 sid, cache_lobs, nocache_lobs, abstract_lobs
 229 2 0 0
 tablespace, contents, segtype, size_mb
 TEMP TEMPORARY LOB_INDEX 2
 TEMP TEMPORARY LOB_DATA 22
 ------------------------------------------

java -cp .:$ORACLE_HOME/jdbc/lib/ojdbc6.jar TempLobMemTest 2 no event
 alter session set events '60025 trace name context forever'
 >>> Lob insert...1
 sid, cache_lobs, nocache_lobs, abstract_lobs
 229 1 0 0
 tablespace, contents, segtype, size_mb
 TEMP TEMPORARY LOB_INDEX 2
 TEMP TEMPORARY LOB_DATA 12
 ------------------------------------------
 >>> Lob insert...2
 sid, cache_lobs, nocache_lobs, abstract_lobs
 229 2 0 0
 tablespace, contents, segtype, size_mb
 TEMP TEMPORARY LOB_INDEX 2
 TEMP TEMPORARY LOB_DATA 22
 ------------------------------------------

java -cp .:$ORACLE_HOME/jdbc/lib/ojdbc6.jar TempLobMemTest 2 free no
 >>> Lob insert...1
 sid, cache_lobs, nocache_lobs, abstract_lobs
 229 0 0 0
 tablespace, contents, segtype, size_mb
 TEMP TEMPORARY LOB_INDEX 2
 TEMP TEMPORARY LOB_DATA 12
 ------------------------------------------
 >>> Lob insert...2
 sid, cache_lobs, nocache_lobs, abstract_lobs
 229 0 0 0
 tablespace, contents, segtype, size_mb
 TEMP TEMPORARY LOB_INDEX 2
 TEMP TEMPORARY LOB_DATA 12
 ------------------------------------------

java -cp .:$ORACLE_HOME/jdbc/lib/ojdbc6.jar TempLobMemTest 2 free event
 alter session set events '60025 trace name context forever'
 >>> Lob insert...1
 sid, cache_lobs, nocache_lobs, abstract_lobs
 229 0 0 0
 tablespace, contents, segtype, size_mb
 ------------------------------------------
 >>> Lob insert...2
 sid, cache_lobs, nocache_lobs, abstract_lobs
 229 0 0 0
 tablespace, contents, segtype, size_mb
 ------------------------------------------


Recap the above ouptut:
  1. In the first two runs, BLOB freeTemporary() is not called, the locator and contents of the temporary blob are not freed.
  2. In the third run, freeTemporary() is called, but event not set, the locator is freed, but the last contents of the temporary blob is not freed.
  3. In the fourth run, freeTemporary() is called, and event is also set, the locator is freed, and the last contents of the temporary blob is also freed immediately.
We can see that V$TEMPORARY_LOBS counts the locators of temporary blobs, and V$TEMPSEG_USAGE represents memory usage (in Buffer Cache BLOCKS).

---------------  TestCode ---------------


import oracle.sql.*;
import java.sql.*;
import java.util.*;
import javax.sql.*;
import oracle.jdbc.*;
/*
DB setup:
  create table test_blob(x blob);
References:
  How to Release Temporary LOB Segments without Closing the JDBC Connection (Doc ID 1384829.1)
  https://community.oracle.com/thread/906568
Compile:
  /usr/java/bin/javac -cp .:$ORACLE_HOME/jdbc/lib/ojdbc6.jar TempLobMemTest.java
Usage:
  TempLobMemTest <numloops> <freeTemporary_?> <event_60025__?>
Runs:
  /usr/java/bin/java -cp .:$ORACLE_HOME/jdbc/lib/ojdbc6.jar TempLobMemTest 2 no no
  /usr/java/bin/java -cp .:$ORACLE_HOME/jdbc/lib/ojdbc6.jar TempLobMemTest 2 no event
  /usr/java/bin/java -cp .:$ORACLE_HOME/jdbc/lib/ojdbc6.jar TempLobMemTest 2 free no
  /usr/java/bin/java -cp .:$ORACLE_HOME/jdbc/lib/ojdbc6.jar TempLobMemTest 2 free event
Oracle V$ performance views:
  freeTemporary_?: free pointer, but not memory.        see in v$temporary_lobs
  event_60025__?:  free memory.                         see v$tempseg_usage
*/
public class TempLobMemTest
{
  private static String URL = "jdbc:oracle:thin:user/pwd@testhost:1522:testdb";
  private static int numLoop;
  private static String freeTemporary;
  private static String event_60025;
  private static String SQL_LOB =
   "select sid, cache_lobs, nocache_lobs, abstract_lobs " +
    "from v$temporary_lobs where sid=sys.dbms_support.mysid";
  private static String SQL_MEM =
   "select t.tablespace, t.contents, t.segtype, " +
           "round (((t.blocks * 8192) / 1024 / 1024), 2) size_mb " +
    "from v$tempseg_usage t, v$session s " +
    "where s.saddr=t.session_addr and s.sid=sys.dbms_support.mysid";
  private static String SQL_EVENT =
    "alter session set events '60025 trace name context forever'";
 
  private static void printMemInfo(Connection conn) throws SQLException {
      Statement stmt = conn.createStatement();
      ResultSet rs = stmt.executeQuery(SQL_LOB);
      System.out.println("sid, cache_lobs, nocache_lobs, abstract_lobs");
      while(rs.next()) {
        System.out.println(
          rs.getString(1) + " " + rs.getString(2) + " " + rs.getString(3) + " " + rs.getString(4));
      }
     
      rs = stmt.executeQuery(SQL_MEM);
      System.out.println("tablespace, contents, segtype, size_mb");
      while(rs.next()) {
        System.out.println(
          rs.getString(1) + " " + rs.getString(2) + " " + rs.getString(3) + " " + rs.getString(4));
      }
      System.out.println("------------------------------------------");
  }
 
  private static void insertBlob(Connection conn) throws SQLException {
      BLOB b = BLOB.createTemporary(conn, true, BLOB.DURATION_SESSION);
      byte[] val = new byte[1024*1024*10];
      Arrays.fill(val, (byte) 1);
      b.setBytes(1, val);
      OraclePreparedStatement pstmt = (OraclePreparedStatement) conn
          .prepareStatement("insert into test_blob values(?)");
      pstmt.setBLOB(1, b);
      pstmt.execute();
     
      if (freeTemporary.equals("free"))
         b.freeTemporary(); // free pointer, no more seen in v$temporary_lobs; but memory not freed.
         //b.free();        // same as freeTemporary()
      conn.commit();
  }
 
  public static void main(String[] args) throws Exception {
      Class.forName("oracle.jdbc.driver.OracleDriver");
      Connection conn = DriverManager.getConnection(URL);
      numLoop = Integer.parseInt(args[0]);
      freeTemporary = args[1];
      event_60025 = args[2];
      Statement stmt = conn.createStatement();
     
      if (event_60025.equals("event")) {
        System.out.println(SQL_EVENT);
        stmt.executeUpdate(SQL_EVENT);
      }
     
      for(int i = 0; i < numLoop; ++i) {
        System.out.println(">>> Lob insert..." + (i+1));
        insertBlob(conn);
        printMemInfo(conn);
      }
  }
}

Wednesday, April 29, 2015

IBM AIX POWER7 CPU Usage and Throughput

This Blog summaries author's observation and understanding on CPU usage of AIX POWER7 SMT4 Architecture.
POWER8 content is added in Section "8 Power8".

1. POWER7 Execution Units


 POWER7 Core is made of 12 execution units [1]:

   2 fixed-point units
   2 load/store units
   4 double-precision floating-point units
   1 vector unit supporting VSX
   1 decimal floating-point unit
   1 branch unit
   1 condition register unit

2. CPU Usage and Throughput


When setting SMT=4, each Core provides 4 Hardware Thread Contexts(HTC, logic CPU) and can simultaneously executes 4 Software Threads (Processes, Tasks).

For example, if more then 2 threads want to run concurrently floating point at the same time (cycle), then the 3rd and 4th thread would have to wait CPU cycles for access to the two FP units to be free. Therefore, with SMT=4, number of instructions executed by a single HTC slows down, but overall throughput goes up per core. IBM claims 60% boost of throughput. That means when 4 processes run on a core (smt=4), it delivers 1.6 times throughput than a a single process per core ([2], [3]). In case of smt=2, the boost is 40%, or 1.4 times throughput (note that we use lower case "smt=4" to differ it from POWER SMT configuration of SMT=4).

Mathematically, with smt=4, one could think that 25% Core usage provide 40% CPU power. The response time is increased from 1 to 2.5 (= 1/0.4) instead of 4.

Now it comes puzzle, how much should we show the CPU usage for each HTC and each process ? 25% or 40% in the above example ? In general, measuring and modelling SMT cpu usage is an on-going research subjects ([5]).

POWER7 is advanced with a new Model of CPU usage. The general intent of POWER7 is to provide a measure of CPU utilization wherein there is a linear relationship between the current throughput (e.g., transactions per second) and the CPU utilization being measured for that level of throughput [2].

In smt=4, 4 Threads running per core, each Thread shares 25% of a whole core, and provides 40% throughput in comparing to smt=1. To build up the linear relation of throughput to CPU usage, the CPU usage of smt=1, 2, 3, 4 can be computed as:

 CPU%(smt=1) = (1.0/0.4) * 25% = 62.50%

 CPU%(smt=2) = (0.7/0.4) * 25% = 43.75%

 CPU%(smt=3) = (0.5/0.4) * 25% = 31.25%

  CPU%(smt=4) = (0.4/0.4) * 25% = 25.00%

Note that for smt=3, boost of 50%, or 1.5 times throughput, stems from this Blog's test, and can be inaccurate.

Expressed in linear equation, it looks like:

  t = f(s) * u

where t for Throughput, s for smt (with possible value of 1, 2, 3, 4), u for CPU Usage.

Putting all together, we can draw a table:

smt Throughput/core Throughput/HTC CPU%
1 1.0 1.0 62.50
2 1.4 0.7 43.75
3 1.5 0.5 31.25
4 1.6 0.4 25.00

Table-1

Therefore, maximum CPU usage of HTC (logic CPU) and Software Thread (Process or Task) is 62.5%. In POWER7 SMT=4, it would be astonishing if it were possible to observe a Process CPU usage more than 65%,  or a HTC's CPU usage more than 65% (mpstat -s).

Picking performance test data out of Blog [6] Table-1 (tested on POWER7, 4 Core, SMT=4, Oracle 11.2.0.3.0), and verifying against above linear relations:

JOB_CNT HTC/Core C2_RUN_CNT Throughtput/HTC Throughput_Based_CPU% Throughput_Ratio_to_Min Theory_Throughput/HTC
1 1 119 119 (119/1) 64.67 2.59 (119/46) 115.00
8 2 580 73 (580/8) 39.40 1.58 (73/46) 80.50
12 3 654 55 (654/12) 29.89 1.20 (55/46) 57.50
16 4 730 46 (730/16) 25.00 1.00 (46/46) 46.00

Table-2

where Throughput_Based_CPU%:

   (119/46)*25% = 64.67%
   (73/46)*25%  = 39.40%
   (55/46)*25%  = 29.89%

and Theory_Throughput/HTC based on linear interpolation:

   46*(0.6250/0.25) = 115.00
   46*(0.4375/0.25) = 80.50
   46*(0.3125/0.25) = 57.50

Table-2 shows that the Theory_Throughput is close to tested Throughput. Thus the designated CPU usage is a calibrated, scalable metric (notice: "S" in POWER8 servers signifies Scale-out.).

Usually applications with a lot of transaction are benchmarked in terms of Throughput, hence binding Throughput linearly to CPU usage is a practical approach to assess application performance.

In principle, CPU usage represents the throughput, and its complement (1 - Usage) stands for the remaining available capacity. One process running in one core with CPU usage of 62.5% on first HTC stands for that there is still 37.5% available capacity on other 3 HTCs, each of which can share a portion of 12.5%.

In practice, CPU utilization can be applied as metric for charging back of computing resources used, and its complement can be used for predication of capacity planning.

This model of SMT CPU accounting is not widely acknowledged, and therefore caused confusion. For example, Oracle Note 1963791.1:

 Unaccounted Time in 10046 File on AIX SMT4 Platform when Comparing Elapsed and CPU Time (Doc ID 1963791.1) [8]

where session trace records:
 cpu time = 86.86   waited time = 7.06  elapsed time = 142.64

and the difference:
 142.64 - (86.86 + 7.06) = 48.72 seconds,
is interpreted as "Unaccounted Time".

In fact,
   86.86/142.64 = 60.90%,
indicates that almost a single Oracle session alone occupies one full core.

Blog [9] also reported the similar observation on AIX POWER7 and trying to explain the unaccounted time.

Probably people working in other UNIX (Solaris, HP-UX, Linux) gets used to intuitive interpretation of CPU time and elapsed time, but with the advancing of multi-threaded processors like AIX, an inception of rethinking would help disperse the confusion so that CPU resource can be efficiently allocated and accurately assessed.

3. POWER PURR


According to [2][4],

 POWER5 includes a per-thread processor utilization of resources register (PURR), which increments at the timebase frequency multiplied by the fraction of cycles on which the thread can dispatch instructions.

 Beginning with IBM® POWER5 TM processor architecture, a new register, PURR, is introduced to assist in computing the utilization.
 The PURR stands for Processor Utilization of Resources Register and its available per Hardware Thread Context.

 The PURR counts in proportion to the real time clock (timebase)
 The SPURR stands for Scaled Processor Utilization of Resources Register.
 The SPURR is similar to PURR except that it increments proportionally to the processor core frequency.
 The AIX® lparstat, sar & mpstat utilities are modified to report the PURR-SPURR ratio via a new column, named "nsp".

and it demonstrates the enhanced command: time (timex), sar -P ALL, mpstat -s, lpstat -E

AIX provides command pprof with flag: -r PURR to report CPU usage in PURR time instead of TimeBase.

For example, start one CPU intensive Oracle session (process) in one core for a duration of 120 seconds:

  exec xpp_test.run_job(p_case => 2, p_job_cnt => 1, p_dur_seconds => 120);
 (see TestCase in Blog [7], POWER7, 4 Core, SMT=4)

in this case, single Core runs in smt=1, then tracking its PURR time for 100 seconds by:
 pprof 100 -r PURR

and displaying the report by:
 head -n 50 pprof.cpu

The output shows (irrelevant lines removed):

                    Pprof CPU Report
        E = Exec'd      F = Forked
        X = Exited      A = Alive (when traced started or stopped)
        C = Thread Created
        * = Purr based values
               Pname      PID     PPID  BE      TID     PTID ACC_time*  STT_time  STP_time   STP-STT
               =====    =====    ===== ===    =====    ===== =========  ========  ========  ========
     ora_j000_testdb 42598406  7864540  AA 21299317        0    62.930     0.037    99.805    99.768

If tracking with TimeBase by:
 pprof 100

The output (head -n 50 pprof.cpu) looks like:

               Pname      PID     PPID  BE      TID     PTID  ACC_time  STT_time  STP_time   STP-STT
               =====    =====    ===== ===    =====    =====  ========  ========  ========  ========
     ora_j000_testdb  1835064        0  AA  2687059        0    99.899     0.016    99.916    99.900

Continues with our example by starting 8 CPU intensive Oracle sessions (each Core runs in smt=2):

  exec xpp_test.run_job(p_case => 2, p_job_cnt => 8, p_dur_seconds => 120);

and look PURR report for one Oracle process:

               Pname      PID     PPID  BE      TID     PTID ACC_time*  STT_time  STP_time   STP-STT
               =====    =====    ===== ===    =====    ===== =========  ========  ========  ========
     ora_j007_testdb 17760298  7864540  AA 57475195        0    42.910     0.340    99.210    98.870
    
then starting 12 CPU intensive Oracle sessions (each Core runs in smt=3):

  exec xpp_test.run_job(p_case => 2, p_job_cnt => 12, p_dur_seconds => 120);

and look PURR report for one Oracle process:
   
               Pname      PID     PPID  BE      TID     PTID ACC_time*  STT_time  STP_time   STP-STT
               =====    =====    ===== ===    =====    ===== =========  ========  ========  ========
     ora_j007_testdb 33095898  7864540  AA 50135123        0    30.658     0.017   100.008    99.990

And finally starting 16 CPU intensive Oracle sessions (each Core runs in smt=4):

  exec xpp_test.run_job(p_case => 2, p_job_cnt => 16, p_dur_seconds => 120);

and look PURR report for one Oracle process:

               Pname      PID     PPID  BE      TID     PTID ACC_time*  STT_time  STP_time   STP-STT
               =====    =====    ===== ===    =====    ===== =========  ========  ========  ========
     ora_j014_testdb 33488964  7864540  AA 73531621        0    24.673     0.143    99.145    99.002

We can see that ACC_time* correlates well with CPU% of Table-1. The little difference is probably due to single point of contention on Oracle latch: row cache objects - child: dc_users [7].

The study[4] shows that PURR limits its inaccuracy to 26% in the single core POWER5 configuration (CMP+SMT).

Additionally, in AIX, the number of physical processors consumed is reported by sar.physc and vmstat.pc, percentage of entitled capacity consumed is reported by sar.%entc and vmstat.ec.

By the way, Linux on POWER machine reads PURR at regular intervals and make the values available through a file in the procfs [4].


3.1 PURR APIs


AIX perfstat_cpu APIs include:
  perfstat_cpu_total(), perfstat_partition_total(), and perfstat_cpu_total_wpar(), perfstat_cpu_util().
perfstat_partition_total Subroutine retrieves global Micro-Partitioning® usage statistics into:
  perfstat_partition_total_t lparstats;  
In libperfstat.h File[12], perfstat_partition_total_t contains the following PURR members:

  purr_counter            Number of purr cycles spent in user and kernel mode.
  u_longlong_t puser  Raw number of physical processor ticks in user mode.
  u_longlong_t psys   Raw number of physical processor ticks in system mode.
  u_longlong_t pidle  Raw number of physical processor ticks idle.
  u_longlong_t pwait  Raw number of physical processor ticks waiting for I/O.    
perfstat_cpu_total Subroutine fills perfstat_cpu_total_t, which contains the following members:

  u_longlong_t user  Raw total number of clock ticks spent in user mode.
  u_longlong_t sys   Raw total number of clock ticks spent in system mode.
  u_longlong_t idle  Raw total number of clock ticks spent idle.
  u_longlong_t wait  Raw total number of clock ticks spent waiting for I/O.
(see File: libperfstat.h)

perfstat_partition_total Interface demonstrates emulating the lparstat command including PCPU:

  /* calculate physcial processor tics during the last interval in user, system, idle and wait mode  */
  delta_pcpu_user  = lparstats.puser - last_pcpu_user; 
  delta_pcpu_sys   = lparstats.psys  - last_pcpu_sys;
  delta_pcpu_idle  = lparstats.pidle - last_pcpu_idle;
  delta_pcpu_wait  = lparstats.pwait - last_pcpu_wait; 
Oracle'link to AIX perfstat_cpu_total is visible by:

$ nm -Xany $ORACLE_HOME/bin/oracle |grep perfstat_cpu_total

Name                 Type  Value       Size
-------------------  ----  ----------  ----
.perfstat_cpu_total  T     4580969124
.perfstat_cpu_total  t     4580969124   44
perfstat_cpu_total   U              -
perfstat_cpu_total   d     4861786384    8

 Type:
   T Global text symbol
   t Local text symbol
   U Undefined symbol
   d Local data symbol
but it is not clear how Oracle calls perfstat_partition_total.

4. Thread Scheduling and Throughput


Picking performance test data out of Blog [6] Table-3, and adding smt per Core, we got:

JOB_CNT C1_RUN_CNT(Throughput) MIN MAX Core_1_smt Core_2_smt Core_3_smt Core_4_smt Theory_Throughput
1 118 118 118 1 0 0 0 118.00
2 240 120 120 1 1 0 0 236.00
3 360 120 120 1 1 1 0 354.00
4 461 109 120 1 1 1 1 472.00
5 476 74 119 2 1 1 1 519.20
6 515 75 97 2 2 1 1 566.40
7 551 66 100 2 2 2 1 613.60
8 569 63 77 2 2 2 2 660.80
9 597 58 76 3 2 2 2 672.60
10 601 56 73 3 3 2 2 684.40
11 613 48 67 3 3 3 2 696.20
12 646 49 64 3 3 3 3 708.00
13 683 47 66 4 3 3 3 719.80
14 696 46 65 4 4 3 3 731.60
15 714 45 51 4 4 4 3 731.60
16 733 44 47 4 4 4 4 755.20

Table-3

where Theory_Throughput is calculated based on above Table-1 and smt per Core, for example:

 JOB_CNT=5,  118*1.4*1+118*3     = 519.2
 JOB_CNT=11, 118*1.5*3+118*1.4*1 = 696.2
 JOB_CNT=15, 118*1.6*2+118*1.5*2 = 731.6

Blog [3] reveals the particularity on non-existence of smt=3 mode, which said: when starting 9 processes (Oracle sessions) on a POWER7 4 Cores SMT=4, there will be 1 Cores running with 4 HTCs, and 1 Core having only 1 HTC, and 2 Cores with 2 HTCs  in total, 9 HTCs for 9 Oracle sessions.

As we tested by running:

  exec xpp_test.run_job(p_case => 2, p_job_cnt => 9, p_dur_seconds => 120);
 
"pprof 100 -r PURR" shows:

            Pname      PID     PPID  BE      TID     PTID ACC_time*  STT_time  STP_time   STP-STT
            =====    =====    ===== ===    =====    ===== =========  ========  ========  ========
  ora_j000_testdb 43646996  7864540  AA 65273961        0    43.578     0.008    97.823    97.815
  ora_j001_testdb 14090250  7864540  AA 45744377        0    43.314     1.621   100.010    98.389
  ora_j002_testdb 38338754  7864540  AA 28442745        0    40.696     0.007   100.000    99.993
  ora_j003_testdb 33095926  7864540  AA 78119153        0    36.575     0.010    99.922    99.912
  ora_j004_testdb 39583756  7864540  AA 35258545        0    36.204     2.242    97.824    95.582
  ora_j005_testdb 42401958  7864540  AA 73662611        0    36.020     1.131    99.731    98.600
  ora_j006_testdb 12976182  7864540  AA 68681805        0    35.944     2.212    99.912    97.700
  ora_j007_testdb 49086646  7864540  AA 75563151        0    32.372     1.893    97.823    95.930
  ora_j008_testdb  7602206  7864540  AA 32112699        0    31.676     1.893    97.823    95.930


"mpstat -ws" shows:

              Proc0                           Proc4            
              94.56%                          99.92%           
  cpu0    cpu1    cpu2    cpu3    cpu4    cpu5    cpu6    cpu7 
  22.43%  24.13%  36.16%  11.84%  30.39%  20.72%  30.08%  18.72%
 
              Proc8                          Proc12            
             100.00%                         100.00%           
  cpu8    cpu9    cpu10   cpu11   cpu12   cpu13   cpu14   cpu15
  25.72%  24.80%  16.14%  33.35%  48.67%  46.49%   2.35%   2.49%


"sar -P ALL" shows:

  cpu    %usr    %sys    %wio   %idle   physc   %entc
    0      47       6       0      47    0.20     5.1
    1      60       0       0      40    0.24     5.9
    2      92       0       0       8    0.38     9.4
    3      14       0       0      86    0.13     3.2
    4      99       0       0       1    0.32     8.1
    5      87       0       0      13    0.19     4.7
    6      89       0       0      11    0.30     7.6
    7      73       0       0      27    0.19     4.6
    8     100       0       0       0    0.26     6.4
    9     100       0       0       0    0.25     6.2
    10     43       0       0      57    0.15     3.8
    11    100       0       0       0    0.34     8.6
    12    100       0       0       0    0.48    12.0
    13    100       0       0       0    0.48    12.0
    14      0       0       0     100    0.02     0.5
    15     11       0       0      89    0.02     0.5
    U       -       -       0       2    0.05     1.3
    -      84       0       0      16    3.95    98.7


However, the above outputs shows that 2 HTCs (cpu14, cpu15) are almost idle, one (cpu3) has low workload, and the other 13 HTCs are more or less busy, probably only top 9 busy HTCs are for 9 running Oracle sessions.

Two idle HTCs (cpu14, cpu15) in Core 4 could also signify that 3 Cores running in smt=4 and one Core in smt=2.

Applying it to 15 processes (Oracle sessions), there will be 3 Cores running with 4 HTCs, and one Core having only 2 HTCs, in total, 14 HTCs for 15 Oracle sessions.

Let's test it by:

  exec xpp_test.run_job(p_case => 2, p_job_cnt => 15, p_dur_seconds => 120);

"pprof 100 -r PURR" shows:

               Pname      PID     PPID  BE      TID     PTID ACC_time*  STT_time  STP_time   STP-STT
               =====    =====    ===== ===    =====    ===== =========  ========  ========  ========
     ora_j000_testdb 40108192  7864540  AA 15859877        0    27.389     0.020   100.003    99.983
     ora_j001_testdb 11927686  7864540  AA 40697939        0    27.277     0.023   100.020    99.997
     ora_j002_testdb 43647120  7864540  AA 30867661        0    26.892     0.657   100.003    99.346
     ora_j003_testdb 17760278  7864540  AA 73531567        0    26.695     0.009    99.040    99.031
     ora_j004_testdb  8388634  7864540  AA 60424235        0    26.576     0.003   100.013   100.011
     ora_j005_testdb 33095868  7864540  AA 30801969        0    25.437     0.657    98.956    98.300
     ora_j006_testdb 37290002  7864540  AA 26411087        0    25.278     0.656    97.135    96.478
     ora_j007_testdb 44105940  7864540  AA 39190655        0    24.977     0.003   100.009   100.006
     ora_j008_testdb 45285388  7864540  AA 72482871        0    24.773     0.735   100.013    99.279
     ora_j009_testdb 33488948  7864540  AA 75628593        0    24.262     0.015    96.875    96.860
     ora_j010_testdb 13828318  7864540  AA 67240145        0    24.256     0.016    96.876    96.860
     ora_j011_testdb 42926108  7864540  AA 64421997        0    24.233     0.024    96.874    96.850
     ora_j012_testdb 42336484  7864540  AA 32112833        0    24.180     0.016    96.876    96.860
     ora_j013_testdb 38600812  7864540  AA 47972531        0    24.049     0.734    96.874    96.140
     ora_j014_testdb 46727262  7864540  AA 25690295        0    24.047     0.734    96.874    96.140

"mpstat -s" shows:

              Proc0                           Proc4            
              99.99%                         100.00%           
  cpu0    cpu1    cpu2    cpu3    cpu4    cpu5    cpu6    cpu7 
  21.33%  28.15%  24.69%  25.81%  25.00%  25.03%  24.98%  24.99%
 
              Proc8                          Proc12            
             100.00%                         100.00%           
  cpu8    cpu9    cpu10   cpu11   cpu12   cpu13   cpu14   cpu15
  25.03%  25.02%  24.66%  25.29%  25.92%  24.12%  24.95%  25.02%


"sar -P ALL" shows:

  cpu    %usr    %sys    %wio   %idle   physc   %entc
   0       72       5       0      23    0.21     5.3
   1       94       0       0       6    0.28     7.1
   2       91       0       0       9    0.24     6.1
   3       95       0       0       5    0.26     6.5
   4       99       0       0       1    0.25     6.2
   5      100       0       0       0    0.25     6.3
   6      100       0       0       0    0.25     6.3
   7      100       0       0       0    0.25     6.2
   8      100       0       0       0    0.25     6.3
   9      100       0       0       0    0.25     6.3
   10      99       0       0       1    0.25     6.2
   11     100       0       0       0    0.25     6.3
   12     100       0       0       0    0.25     6.4
   13      98       0       0       2    0.25     6.1
   14     100       0       0       0    0.25     6.2
   15     100       0       0       0    0.25     6.3
   -       97       0       0       3    4.00   100.0


The above outputs deliver no evidence of non-existence of smt=3 mode. It could be possible that I missed some points here. It will be interesting to see how to demonstrate it.

5. AIX Internal Code


AIX "struct procinfo" used in getprocs Subroutine (/usr/include/procinfo.h) contains a comment on pi_cpu:

  struct  procinfo
  {
    /* scheduler information */
    unsigned long   pi_pri;         /* priority, 0 high, 31 low */
    unsigned long   pi_nice;        /* nice for priority, 0 to 39 */
    unsigned long   pi_cpu;         /* processor usage, 0 to 80 */ 
Probably in the mind of AIX developers, process processor usage is not allowed to be over 80.

In fact, in one POWER8 with SMT=2, maximum CPU Utilization of 76% is observed.

All fields in Oracle procsinfo (not AIX struct "procinfo") can be listed by:

$ nm -Xany $ORACLE_HOME/bin/oracle |grep -i procsinfo

  procsinfo:T153=s1568
  pi_pid:121,0,32;
  pi_ppid:121,32,32;
  ...
  pi_cpu:123,448,32;
  ...
  pi_utime:123,864,32;
  ...
  pi_stime:123,928,32;
  ...

6. vpm_throughput_mode


This AIX scheduler tunable parameter specifies the desired level of SMT exploitation for scaled throughput mode.

A value of 0 gives default behavior (raw throughput mode).
A value of 1, 2, or 4 selects the scaled throughput mode and the desired level of SMT exploitation. It is the number of threads used by one core before using next core.

schedo –p –o vpm_throughput_mode=
  0 Legacy Raw mode (default)
  1 Enhanced Raw mode with a higher threshold than legacy
  2 Scaled mode, use primary and secondary SMT threads
  4 Scaled mode, use all four SMT threads


Raw Mode (0, 1)

provides the highest per-thread throughput and best response times at the expense of activating more physical core. For example, Legacy Raw mode (default) dispatches workload to all primary threads before using any secondary threads.

Secondary threads are activated when the load of all primary threads is over certain utilization, probably 50%, and new workload (process) comes to be dispatched for running.

3rd and 4th threads are activated when the load of secondary threads is over certain utilization, probably 20%, and new workload (process) comes to be dispatched for running.

Scaled Mode (2, 4)

intends the highest per-core throughput (in the specified mode: 2 or 4) at the expense of per-thread response times and throughput. For example, Scaled mode 2 dispatches workload to both primary and secondary threads of one core before using those of next core. Scaled mode 4 dispatches workload to all 4 threads of one core before using those of next core.

In Scaled mode 2, 1st and 2nd threads of each core are bound together, thus both have the similar workload (CPU Usage). 3rd and 4th threads are activated when the load of 1st and 2nd threads is over certain utilization, probably 30%, and new workload (process) comes to be dispatched for running.

Note that this tuning intention is per active core, not all cores in the LPAR. In fact, it is aimed at activating less cores. It would be a setting conceived for a test system with a few LPARs.

Referring to Table-1, vpm_throughput_mode = 2 is corresponding to smt = 2, two threads are running per core, Throughput/HTC = 0.7, CPU% = 43.75. In real applications with Scaled mode 2, we also observed that CPU% is constrained under 43% even if runqueue is shorter than number of cores. That means even though workload is low, CPU% can not score up to its maximum of 62.50, and applications can not benefit from the maximum Throughput/HTC. For the performance critical application, Scaled mode is questionable. On the contrary, Raw Mode automatically tunes the CPU% based on workload. That is probably why vpm_throughput_mode is in default set to 0.

We can see there is no vpm_throughput_mode=3. Probably it is related to Blog [3] mentioned the particularity on non-existence of smt=3 mode.

There is also a naming confusion. In default, POWER7 runs in "Legacy Raw mode", and POWER6 behaves like "scaled throughput mode". Normally "Legacy" means it was used in some previous model or release,  but here POWER6 uses something like "Scaled mode, and a later model (POWER7) introduced a "Legacy" mode 0.

7. NMON Report


NMON report contains three aspects of worksheets on CPU usage: PCPU_ALL (PCPUnnn), SCPU_ALL (SCPUnnn), CPU_ALL (CPUnnn).

AIXpert Blog[10] said:

   If I had to guess then the Utilisation numbers in our PCPU_ALL graph (above) have been scaled from 75 cores to roughly 62 cores so "show" some SMT threads are unused so the CPU cores are not fully used (and given enough threads it could give you more performance). Roughly 10 - 15% more. Now, in my humble opinion, this is totally the wrong way of doing this as it is just plain confusing.

   The PCPU and SCPU stats where (in my humble opinion) a confusing mistake and only useful if you have the CPUs in Power saving mode i.e. its changing the CPU GHz to save electrical power.

and IBM developerWorks Forum[11] described:

   PCPU_ALL is the actual physical resource consumption. It would be in units of cores.
   SCPU_ALL is the scaled physical resource consumption. Differs from PCPU_ALL if running at non-nominal frequency.
          Again in units of cores. SCPU, PCPU do not differ when the system runs in the nominal frequency.
   CPU_ALL: PhysicalCPU tag (0.376) denotes the fraction of core used by this partition.
          The distribution of the 0.376 across various modes (user, sys, wait, idle) is proportional to the CPU_ALL% in all modes. 
         Applying this % would give the PCPU_ALL.

In short, PCPU_ALL represents PURR, SCPU_ALL for SPURR; and CPU_ALL denotes PCPU_ALL modes (user, sys, wait, idle) in percentage, and sum of them should be around 100%.

PCPUnnn represents CPU% of one single HTC (logical CPU, see Table-1), PCPU_ALL is sum of all PCPUnnn across various modes (user, sys, wait, idle).

In case of smt=2 (only two HTCs per core are activated), at each time instance, sum of user and sys in PCPUnnn should be under 43.75%; for each core, sum of user and sys should be under 87.5% (2*43.75); whereas for whole LPAR, sum of user and sys in PCPU_ALL should be under number_of_core * 87.50%.

In case of smt=4 (all 4 HTCs per core are activated), at each time instance, sum of user and sys in PCPUnnn should be under 25.00%; for each core, sum of user and sys should be under 100.00% (4*25.00); whereas for whole LPAR, sum of user and sys in PCPU_ALL should be under number_of_core * 100.00%.

In TOP worksheet, %CPU, %Usr, %Sys also represent PURR. Note that if Threads > 1, they are the sum of all Threads aggregated by PID, hence can be more than 80%.

In Oracle AWR report, %CPU is in PURR too.

8. Power8


(a). SMT=4, for single HTC (smt=1), CPU% = 60.00% instead of 62.50% in Power7 (see Table-1).
       The Throughput Ratio of smt=1 vs. smt=4 is 60.00/25.00 = 2.4 instead of 2.5 in Power7,
       that means about 4% (2.5/62.5 = 0.1/2.5 = 0.04) less than Power7.
  
(b). SMT=8, for single HTC (smt=1), CPU% = 56.00%
       The Throughput Ratio of smt=1 vs. smt=8 is 56.00/12.50 = 4.48.

The above preliminary figures needs to be further verified.

POWER8 Each core has 16 execution pipelines[13]:
 2  fixed-point pipelines
 2  load-store pipelines
 2* load pipelines (no results to store)
 4* double-precision floating-point pipelines, which can also act as eight single-precision pipelines
 2* fully symmetric vector pipelines with support for VMX and VSX AltiVec instructions.
 1  decimal floating-point pipeline
 1* cryptographic pipeline (AES, Galois Counter Mode, SHA-2)
 1  branch execution pipeline
 1  condition register logical pipeline
 
  Note: All units different from POWER7 are marked by "*". 
        POWER7 Core has 12 execution units, POWER8 16.

9. Conclusion


This Blog presented the POWER7 model of CPU usage and throughput, and examined with real cases. Accurately modelling leads to not only fruitful system tuning and trustful performance assessment, but also fairly charging back and economically resource utilizing (e.g. Power Saver Mode). As coming study, we will investigate the applicability and eventually adaptation of the model on new POWER8 (SMT=8), and the future POWER9.

References


1. POWER7

2. Understanding CPU Utilization on AIX

3. Local, Near & Far Memory part 3 - Scheduling processes to SMT & Virtual Processors

4. P. Mackerras, T. S. Mathews, and R. C. Swanberg. Operating System Exploitation of the POWER5 System.
    IBM J. Res. Dev., 49(4/5):533–539, July 2005.

5. CPU accounting in multi-threaded processors

6. java stored procedure calls and latch: row cache objects, and performance

7. java stored procedure calls and latch: row cache objects

8. Unaccounted Time in 10046 File on AIX SMT4 Platform when Comparing Elapsed and CPU Time (Doc ID 1963791.1)

    Bug 13354348 : UNACCOUNTED GAP BETWEEN ELAPSED TO CPU TIME ON 11.2 IN AIX
    Bug 16044824 - UNACCOUNTED GAP BETWEEN ELAPSED AND CPU TIME FOR DB 11.2 ON PLATFORM AIX POWER7
    Bug 18599013 : NEED TO CALCULATE THE UNACCOUNTED TIME FOR A TRACE FILE
    Bug 7410881 : HOW CPU% UTILIZATION COLLECTED ON AIX VIA EM
    Bug 15925194 : AIX COMPUTING METRICS INCORRECTLY

9. Oracle on AIX - where's my cpu time ?

10. nmon CPU graphs - Why are the PCPU_ALL graphs lower?

11. dW:AIX and UNIX:Performance Tools Forum:Nmon - PCPU_ALL

12. libperfstat.h

13. POWER8