On Oracle: April 2015

This Blog summaries author's observation and understanding on CPU usage of AIX POWER7 SMT4 Architecture.
POWER8 content is added in Section "8 Power8".

1. POWER7 Execution Units

POWER7 Core is made of 12 execution units [1]:

   2 fixed-point units
   2 load/store units
   4 double-precision floating-point units
   1 vector unit supporting VSX
   1 decimal floating-point unit
   1 branch unit
   1 condition register unit

2. CPU Usage and Throughput

When setting SMT=4, each Core provides 4 Hardware Thread Contexts(HTC, logic CPU) and can simultaneously executes 4 Software Threads (Processes, Tasks).

For example, if more then 2 threads want to run concurrently floating point at the same time (cycle), then the 3rd and 4th thread would have to wait CPU cycles for access to the two FP units to be free. Therefore, with SMT=4, number of instructions executed by a single HTC slows down, but overall throughput goes up per core. IBM claims 60% boost of throughput. That means when 4 processes run on a core (smt=4), it delivers 1.6 times throughput than a a single process per core ([2], [3]). In case of smt=2, the boost is 40%, or 1.4 times throughput (note that we use lower case "smt=4" to differ it from POWER SMT configuration of SMT=4).

Mathematically, with smt=4, one could think that 25% Core usage provide 40% CPU power. The response time is increased from 1 to 2.5 (= 1/0.4) instead of 4.

Now it comes puzzle, how much should we show the CPU usage for each HTC and each process ? 25% or 40% in the above example ? In general, measuring and modelling SMT cpu usage is an on-going research subjects ([5]).

POWER7 is advanced with a new Model of CPU usage. The general intent of POWER7 is to provide a measure of CPU utilization wherein there is a linear relationship between the current throughput (e.g., transactions per second) and the CPU utilization being measured for that level of throughput [2].

In smt=4, 4 Threads running per core, each Thread shares 25% of a whole core, and provides 40% throughput in comparing to smt=1. To build up the linear relation of throughput to CPU usage, the CPU usage of smt=1, 2, 3, 4 can be computed as:

CPU%(smt=1) = (1.0/0.4) * 25% = 62.50%

CPU%(smt=2) = (0.7/0.4) * 25% = 43.75%

CPU%(smt=3) = (0.5/0.4) * 25% = 31.25%

CPU%(smt=4) = (0.4/0.4) * 25% = 25.00%

Note that for smt=3, boost of 50%, or 1.5 times throughput, stems from this Blog's test, and can be inaccurate.

Expressed in linear equation, it looks like:

t = f(s) * u

where t for Throughput, s for smt (with possible value of 1, 2, 3, 4), u for CPU Usage.

Putting all together, we can draw a table:

smt	Throughput/core	Throughput/HTC	CPU%
1	1.0	1.0	62.50
2	1.4	0.7	43.75
3	1.5	0.5	31.25
4	1.6	0.4	25.00

Table-1

Therefore, maximum CPU usage of HTC (logic CPU) and Software Thread (Process or Task) is 62.5%. In POWER7 SMT=4, it would be astonishing if it were possible to observe a Process CPU usage more than 65%, or a HTC's CPU usage more than 65% (mpstat -s).

Picking performance test data out of Blog [6] Table-1 (tested on POWER7, 4 Core, SMT=4, Oracle 11.2.0.3.0), and verifying against above linear relations:

JOB_CNT	HTC/Core	C2_RUN_CNT	Throughtput/HTC	Throughput_Based_CPU%	Throughput_Ratio_to_Min	Theory_Throughput/HTC
1	1	119	119 (119/1)	64.67	2.59 (119/46)	115.00
8	2	580	73 (580/8)	39.40	1.58 (73/46)	80.50
12	3	654	55 (654/12)	29.89	1.20 (55/46)	57.50
16	4	730	46 (730/16)	25.00	1.00 (46/46)	46.00

Table-2

where Throughput_Based_CPU%:

   (119/46)*25% = 64.67%
   (73/46)*25% = 39.40%
   (55/46)*25% = 29.89%

and Theory_Throughput/HTC based on linear interpolation:

   46*(0.6250/0.25) = 115.00
   46*(0.4375/0.25) = 80.50
   46*(0.3125/0.25) = 57.50

Table-2 shows that the Theory_Throughput is close to tested Throughput. Thus the designated CPU usage is a calibrated, scalable metric (notice: "S" in POWER8 servers signifies Scale-out.).

Usually applications with a lot of transaction are benchmarked in terms of Throughput, hence binding Throughput linearly to CPU usage is a practical approach to assess application performance.

In principle, CPU usage represents the throughput, and its complement (1 - Usage) stands for the remaining available capacity. One process running in one core with CPU usage of 62.5% on first HTC stands for that there is still 37.5% available capacity on other 3 HTCs, each of which can share a portion of 12.5%.

In practice, CPU utilization can be applied as metric for charging back of computing resources used, and its complement can be used for predication of capacity planning.

This model of SMT CPU accounting is not widely acknowledged, and therefore caused confusion. For example, Oracle Note 1963791.1:

Unaccounted Time in 10046 File on AIX SMT4 Platform when Comparing Elapsed and CPU Time (Doc ID 1963791.1) [8]

where session trace records:
cpu time = 86.86   waited time = 7.06 elapsed time = 142.64

and the difference:
142.64 - (86.86 + 7.06) = 48.72 seconds,
is interpreted as "Unaccounted Time".

In fact,
   86.86/142.64 = 60.90%,
indicates that almost a single Oracle session alone occupies one full core.

Blog [9] also reported the similar observation on AIX POWER7 and trying to explain the unaccounted time.

Probably people working in other UNIX (Solaris, HP-UX, Linux) gets used to intuitive interpretation of CPU time and elapsed time, but with the advancing of multi-threaded processors like AIX, an inception of rethinking would help disperse the confusion so that CPU resource can be efficiently allocated and accurately assessed.

3. POWER PURR

According to [2][4],

POWER5 includes a per-thread processor utilization of resources register (PURR), which increments at the timebase frequency multiplied by the fraction of cycles on which the thread can dispatch instructions.

Beginning with IBM® POWER5 TM processor architecture, a new register, PURR, is introduced to assist in computing the utilization.
The PURR stands for Processor Utilization of Resources Register and its available per Hardware Thread Context.

The PURR counts in proportion to the real time clock (timebase)
The SPURR stands for Scaled Processor Utilization of Resources Register.
The SPURR is similar to PURR except that it increments proportionally to the processor core frequency.
The AIX® lparstat, sar & mpstat utilities are modified to report the PURR-SPURR ratio via a new column, named "nsp".

and it demonstrates the enhanced command: time (timex), sar -P ALL, mpstat -s, lpstat -E

AIX provides command pprof with flag: -r PURR to report CPU usage in PURR time instead of TimeBase.

For example, start one CPU intensive Oracle session (process) in one core for a duration of 120 seconds:

  exec xpp_test.run_job(p_case => 2, p_job_cnt => 1, p_dur_seconds => 120);
(see TestCase in Blog [7], POWER7, 4 Core, SMT=4)

in this case, single Core runs in smt=1, then tracking its PURR time for 100 seconds by:
pprof 100 -r PURR

and displaying the report by:
head -n 50 pprof.cpu

The output shows (irrelevant lines removed):

                    Pprof CPU Report
        E = Exec'd      F = Forked
        X = Exited      A = Alive (when traced started or stopped)
        C = Thread Created
        * = Purr based values
               Pname      PID     PPID BE      TID     PTID ACC_time* STT_time STP_time   STP-STT
               =====    =====    ===== ===    =====    ===== ========= ======== ======== ========
     ora_j000_testdb 42598406 7864540 AA 21299317        0    62.930     0.037    99.805    99.768

If tracking with TimeBase by:
pprof 100

The output (head -n 50 pprof.cpu) looks like:

               Pname      PID     PPID BE      TID     PTID ACC_time STT_time STP_time   STP-STT
               =====    =====    ===== ===    =====    ===== ======== ======== ======== ========
     ora_j000_testdb 1835064        0 AA 2687059        0    99.899     0.016    99.916    99.900

Continues with our example by starting 8 CPU intensive Oracle sessions (each Core runs in smt=2):

  exec xpp_test.run_job(p_case => 2, p_job_cnt => 8, p_dur_seconds => 120);

and look PURR report for one Oracle process:

               Pname      PID     PPID BE      TID     PTID ACC_time* STT_time STP_time   STP-STT
               =====    =====    ===== ===    =====    ===== ========= ======== ======== ========
     ora_j007_testdb 17760298 7864540 AA 57475195        0    42.910     0.340    99.210    98.870

then starting 12 CPU intensive Oracle sessions (each Core runs in smt=3):

  exec xpp_test.run_job(p_case => 2, p_job_cnt => 12, p_dur_seconds => 120);

and look PURR report for one Oracle process:

               Pname      PID     PPID BE      TID     PTID ACC_time* STT_time STP_time   STP-STT
               =====    =====    ===== ===    =====    ===== ========= ======== ======== ========
     ora_j007_testdb 33095898 7864540 AA 50135123        0    30.658     0.017   100.008    99.990

And finally starting 16 CPU intensive Oracle sessions (each Core runs in smt=4):

  exec xpp_test.run_job(p_case => 2, p_job_cnt => 16, p_dur_seconds => 120);

and look PURR report for one Oracle process:

               Pname      PID     PPID BE      TID     PTID ACC_time* STT_time STP_time   STP-STT
               =====    =====    ===== ===    =====    ===== ========= ======== ======== ========
     ora_j014_testdb 33488964 7864540 AA 73531621        0    24.673     0.143    99.145    99.002

We can see that ACC_time* correlates well with CPU% of Table-1. The little difference is probably due to single point of contention on Oracle latch: row cache objects - child: dc_users [7].

The study[4] shows that PURR limits its inaccuracy to 26% in the single core POWER5 configuration (CMP+SMT).

Additionally, in AIX, the number of physical processors consumed is reported by sar.physc and vmstat.pc, percentage of entitled capacity consumed is reported by sar.%entc and vmstat.ec.

By the way, Linux on POWER machine reads PURR at regular intervals and make the values available through a file in the procfs [4].

3.1 PURR APIs

AIX perfstat_cpu APIs include:

  perfstat_cpu_total(), perfstat_partition_total(), and perfstat_cpu_total_wpar(), perfstat_cpu_util().

perfstat_partition_total Subroutine retrieves global Micro-Partitioning® usage statistics into:

  perfstat_partition_total_t lparstats;

In libperfstat.h File[12], perfstat_partition_total_t contains the following PURR members:


  purr_counter            Number of purr cycles spent in user and kernel mode.
  u_longlong_t puser  Raw number of physical processor ticks in user mode.
  u_longlong_t psys   Raw number of physical processor ticks in system mode.
  u_longlong_t pidle  Raw number of physical processor ticks idle.
  u_longlong_t pwait  Raw number of physical processor ticks waiting for I/O.

perfstat_cpu_total Subroutine fills perfstat_cpu_total_t, which contains the following members:


  u_longlong_t user  Raw total number of clock ticks spent in user mode.
  u_longlong_t sys   Raw total number of clock ticks spent in system mode.
  u_longlong_t idle  Raw total number of clock ticks spent idle.
  u_longlong_t wait  Raw total number of clock ticks spent waiting for I/O.

(see File: libperfstat.h)

perfstat_partition_total Interface demonstrates emulating the lparstat command including PCPU:


  /* calculate physcial processor tics during the last interval in user, system, idle and wait mode  */
  delta_pcpu_user  = lparstats.puser - last_pcpu_user; 
  delta_pcpu_sys   = lparstats.psys  - last_pcpu_sys;
  delta_pcpu_idle  = lparstats.pidle - last_pcpu_idle;
  delta_pcpu_wait  = lparstats.pwait - last_pcpu_wait;

Oracle'link to AIX perfstat_cpu_total is visible by:


$ nm -Xany $ORACLE_HOME/bin/oracle |grep perfstat_cpu_total

Name                 Type  Value       Size
-------------------  ----  ----------  ----
.perfstat_cpu_total  T     4580969124
.perfstat_cpu_total  t     4580969124   44
perfstat_cpu_total   U              -
perfstat_cpu_total   d     4861786384    8

 Type:
   T Global text symbol
   t Local text symbol
   U Undefined symbol
   d Local data symbol

but it is not clear how Oracle calls perfstat_partition_total.

4. Thread Scheduling and Throughput

Picking performance test data out of Blog [6] Table-3, and adding smt per Core, we got:

JOB_CNT	C1_RUN_CNT(Throughput)	MIN	MAX	Core_1_smt	Core_2_smt	Core_3_smt	Core_4_smt	Theory_Throughput
1	118	118	118	1	0	0	0	118.00
2	240	120	120	1	1	0	0	236.00
3	360	120	120	1	1	1	0	354.00
4	461	109	120	1	1	1	1	472.00
5	476	74	119	2	1	1	1	519.20
6	515	75	97	2	2	1	1	566.40
7	551	66	100	2	2	2	1	613.60
8	569	63	77	2	2	2	2	660.80
9	597	58	76	3	2	2	2	672.60
10	601	56	73	3	3	2	2	684.40
11	613	48	67	3	3	3	2	696.20
12	646	49	64	3	3	3	3	708.00
13	683	47	66	4	3	3	3	719.80
14	696	46	65	4	4	3	3	731.60
15	714	45	51	4	4	4	3	731.60
16	733	44	47	4	4	4	4	755.20

Table-3

where Theory_Throughput is calculated based on above Table-1 and smt per Core, for example:

JOB_CNT=5, 118*1.4*1+118*3     = 519.2
JOB_CNT=11, 118*1.5*3+118*1.4*1 = 696.2
JOB_CNT=15, 118*1.6*2+118*1.5*2 = 731.6

Blog [3] reveals the particularity on non-existence of smt=3 mode, which said: when starting 9 processes (Oracle sessions) on a POWER7 4 Cores SMT=4, there will be 1 Cores running with 4 HTCs, and 1 Core having only 1 HTC, and 2 Cores with 2 HTCs in total, 9 HTCs for 9 Oracle sessions.

As we tested by running:

exec xpp_test.run_job(p_case => 2, p_job_cnt => 9, p_dur_seconds => 120);

"pprof 100 -r PURR" shows:

            Pname      PID     PPID BE      TID     PTID ACC_time* STT_time STP_time   STP-STT
            =====    =====    ===== ===    =====    ===== ========= ======== ======== ========
ora_j000_testdb 43646996 7864540 AA 65273961        0    43.578     0.008    97.823    97.815
ora_j001_testdb 14090250 7864540 AA 45744377        0    43.314     1.621   100.010    98.389
ora_j002_testdb 38338754 7864540 AA 28442745        0    40.696     0.007   100.000    99.993
ora_j003_testdb 33095926 7864540 AA 78119153        0    36.575     0.010    99.922    99.912
ora_j004_testdb 39583756 7864540 AA 35258545        0    36.204     2.242    97.824    95.582
ora_j005_testdb 42401958 7864540 AA 73662611        0    36.020     1.131    99.731    98.600
ora_j006_testdb 12976182 7864540 AA 68681805        0    35.944     2.212    99.912    97.700
ora_j007_testdb 49086646 7864540 AA 75563151        0    32.372     1.893    97.823    95.930
ora_j008_testdb 7602206 7864540 AA 32112699        0    31.676     1.893    97.823    95.930

"mpstat -ws" shows:

              Proc0                           Proc4
              94.56%                          99.92%
cpu0    cpu1    cpu2    cpu3    cpu4    cpu5    cpu6    cpu7
22.43% 24.13% 36.16% 11.84% 30.39% 20.72% 30.08% 18.72%

              Proc8                          Proc12
             100.00%                         100.00%
cpu8    cpu9    cpu10   cpu11   cpu12   cpu13   cpu14   cpu15
25.72% 24.80% 16.14% 33.35% 48.67% 46.49%   2.35%   2.49%

"sar -P ALL" shows:

cpu    %usr    %sys    %wio   %idle   physc   %entc
    0      47       6       0      47    0.20     5.1
    1      60       0       0      40    0.24     5.9
    2      92       0       0       8    0.38     9.4
    3      14       0       0      86    0.13     3.2
    4      99       0       0       1    0.32     8.1
    5      87       0       0      13    0.19     4.7
    6      89       0       0      11    0.30     7.6
    7      73       0       0      27    0.19     4.6
    8     100       0       0       0    0.26     6.4
    9     100       0       0       0    0.25     6.2
    10     43       0       0      57    0.15     3.8
    11    100       0       0       0    0.34     8.6
    12    100       0       0       0    0.48    12.0
    13    100       0       0       0    0.48    12.0
    14      0       0       0     100    0.02     0.5
    15     11       0       0      89    0.02     0.5
    U       -       -       0       2    0.05     1.3
    -      84       0       0      16    3.95    98.7

However, the above outputs shows that 2 HTCs (cpu14, cpu15) are almost idle, one (cpu3) has low workload, and the other 13 HTCs are more or less busy, probably only top 9 busy HTCs are for 9 running Oracle sessions.

Two idle HTCs (cpu14, cpu15) in Core 4 could also signify that 3 Cores running in smt=4 and one Core in smt=2.

Applying it to 15 processes (Oracle sessions), there will be 3 Cores running with 4 HTCs, and one Core having only 2 HTCs, in total, 14 HTCs for 15 Oracle sessions.

Let's test it by:

  exec xpp_test.run_job(p_case => 2, p_job_cnt => 15, p_dur_seconds => 120);

"pprof 100 -r PURR" shows:

               Pname      PID     PPID BE      TID     PTID ACC_time* STT_time STP_time   STP-STT
               =====    =====    ===== ===    =====    ===== ========= ======== ======== ========
     ora_j000_testdb 40108192 7864540 AA 15859877        0    27.389     0.020   100.003    99.983
     ora_j001_testdb 11927686 7864540 AA 40697939        0    27.277     0.023   100.020    99.997
     ora_j002_testdb 43647120 7864540 AA 30867661        0    26.892     0.657   100.003    99.346
     ora_j003_testdb 17760278 7864540 AA 73531567        0    26.695     0.009    99.040    99.031
     ora_j004_testdb 8388634 7864540 AA 60424235        0    26.576     0.003   100.013   100.011
     ora_j005_testdb 33095868 7864540 AA 30801969        0    25.437     0.657    98.956    98.300
     ora_j006_testdb 37290002 7864540 AA 26411087        0    25.278     0.656    97.135    96.478
     ora_j007_testdb 44105940 7864540 AA 39190655        0    24.977     0.003   100.009   100.006
     ora_j008_testdb 45285388 7864540 AA 72482871        0    24.773     0.735   100.013    99.279
     ora_j009_testdb 33488948 7864540 AA 75628593        0    24.262     0.015    96.875    96.860
     ora_j010_testdb 13828318 7864540 AA 67240145        0    24.256     0.016    96.876    96.860
     ora_j011_testdb 42926108 7864540 AA 64421997        0    24.233     0.024    96.874    96.850
     ora_j012_testdb 42336484 7864540 AA 32112833        0    24.180     0.016    96.876    96.860
     ora_j013_testdb 38600812 7864540 AA 47972531        0    24.049     0.734    96.874    96.140
     ora_j014_testdb 46727262 7864540 AA 25690295        0    24.047     0.734    96.874    96.140

"mpstat -s" shows:

              Proc0                           Proc4
              99.99%                         100.00%
cpu0    cpu1    cpu2    cpu3    cpu4    cpu5    cpu6    cpu7
21.33% 28.15% 24.69% 25.81% 25.00% 25.03% 24.98% 24.99%

              Proc8                          Proc12
             100.00%                         100.00%
cpu8    cpu9    cpu10   cpu11   cpu12   cpu13   cpu14   cpu15
25.03% 25.02% 24.66% 25.29% 25.92% 24.12% 24.95% 25.02%

"sar -P ALL" shows:

cpu    %usr    %sys    %wio   %idle   physc   %entc
   0       72       5       0      23    0.21     5.3
   1       94       0       0       6    0.28     7.1
   2       91       0       0       9    0.24     6.1
   3       95       0       0       5    0.26     6.5
   4       99       0       0       1    0.25     6.2
   5      100       0       0       0    0.25     6.3
   6      100       0       0       0    0.25     6.3
   7      100       0       0       0    0.25     6.2
   8      100       0       0       0    0.25     6.3
   9      100       0       0       0    0.25     6.3
   10      99       0       0       1    0.25     6.2
   11     100       0       0       0    0.25     6.3
   12     100       0       0       0    0.25     6.4
   13      98       0       0       2    0.25     6.1
   14     100       0       0       0    0.25     6.2
   15     100       0       0       0    0.25     6.3
   -       97       0       0       3    4.00   100.0

The above outputs deliver no evidence of non-existence of smt=3 mode. It could be possible that I missed some points here. It will be interesting to see how to demonstrate it.

5. AIX Internal Code

AIX "struct procinfo" used in getprocs Subroutine (/usr/include/procinfo.h) contains a comment on pi_cpu:


  struct  procinfo
  {
    /* scheduler information */
    unsigned long   pi_pri;         /* priority, 0 high, 31 low */
    unsigned long   pi_nice;        /* nice for priority, 0 to 39 */
    unsigned long   pi_cpu;         /* processor usage, 0 to 80 */

Probably in the mind of AIX developers, process processor usage is not allowed to be over 80.

In fact, in one POWER8 with SMT=2, maximum CPU Utilization of 76% is observed.

All fields in Oracle procsinfo (not AIX struct "procinfo") can be listed by:


$ nm -Xany $ORACLE_HOME/bin/oracle |grep -i procsinfo

  procsinfo:T153=s1568
  pi_pid:121,0,32;
  pi_ppid:121,32,32;
  ...
  pi_cpu:123,448,32;
  ...
  pi_utime:123,864,32;
  ...
  pi_stime:123,928,32;
  ...

6. vpm_throughput_mode

This AIX scheduler tunable parameter specifies the desired level of SMT exploitation for scaled throughput mode.

A value of 0 gives default behavior (raw throughput mode).
A value of 1, 2, or 4 selects the scaled throughput mode and the desired level of SMT exploitation. It is the number of threads used by one core before using next core.

schedo –p –o vpm_throughput_mode=
0 Legacy Raw mode (default)
1 Enhanced Raw mode with a higher threshold than legacy
2 Scaled mode, use primary and secondary SMT threads
4 Scaled mode, use all four SMT threads

Raw Mode (0, 1)

provides the highest per-thread throughput and best response times at the expense of activating more physical core. For example, Legacy Raw mode (default) dispatches workload to all primary threads before using any secondary threads.

Secondary threads are activated when the load of all primary threads is over certain utilization, probably 50%, and new workload (process) comes to be dispatched for running.

3rd and 4th threads are activated when the load of secondary threads is over certain utilization, probably 20%, and new workload (process) comes to be dispatched for running.

Scaled Mode (2, 4)

intends the highest per-core throughput (in the specified mode: 2 or 4) at the expense of per-thread response times and throughput. For example, Scaled mode 2 dispatches workload to both primary and secondary threads of one core before using those of next core. Scaled mode 4 dispatches workload to all 4 threads of one core before using those of next core.

In Scaled mode 2, 1st and 2nd threads of each core are bound together, thus both have the similar workload (CPU Usage). 3rd and 4th threads are activated when the load of 1st and 2nd threads is over certain utilization, probably 30%, and new workload (process) comes to be dispatched for running.

Note that this tuning intention is per active core, not all cores in the LPAR. In fact, it is aimed at activating less cores. It would be a setting conceived for a test system with a few LPARs.

Referring to Table-1, vpm_throughput_mode = 2 is corresponding to smt = 2, two threads are running per core, Throughput/HTC = 0.7, CPU% = 43.75. In real applications with Scaled mode 2, we also observed that CPU% is constrained under 43% even if runqueue is shorter than number of cores. That means even though workload is low, CPU% can not score up to its maximum of 62.50, and applications can not benefit from the maximum Throughput/HTC. For the performance critical application, Scaled mode is questionable. On the contrary, Raw Mode automatically tunes the CPU% based on workload. That is probably why vpm_throughput_mode is in default set to 0.

We can see there is no vpm_throughput_mode=3. Probably it is related to Blog [3] mentioned the particularity on non-existence of smt=3 mode.

There is also a naming confusion. In default, POWER7 runs in "Legacy Raw mode", and POWER6 behaves like "scaled throughput mode". Normally "Legacy" means it was used in some previous model or release, but here POWER6 uses something like "Scaled mode, and a later model (POWER7) introduced a "Legacy" mode 0.

7. NMON Report

NMON report contains three aspects of worksheets on CPU usage: PCPU_ALL (PCPUnnn), SCPU_ALL (SCPUnnn), CPU_ALL (CPUnnn).

AIXpert Blog[10] said:

   If I had to guess then the Utilisation numbers in our PCPU_ALL graph (above) have been scaled from 75 cores to roughly 62 cores so "show" some SMT threads are unused so the CPU cores are not fully used (and given enough threads it could give you more performance). Roughly 10 - 15% more. Now, in my humble opinion, this is totally the wrong way of doing this as it is just plain confusing.

   The PCPU and SCPU stats where (in my humble opinion) a confusing mistake and only useful if you have the CPUs in Power saving mode i.e. its changing the CPU GHz to save electrical power.

and IBM developerWorks Forum[11] described:

   PCPU_ALL is the actual physical resource consumption. It would be in units of cores.
   SCPU_ALL is the scaled physical resource consumption. Differs from PCPU_ALL if running at non-nominal frequency.
          Again in units of cores. SCPU, PCPU do not differ when the system runs in the nominal frequency.
   CPU_ALL: PhysicalCPU tag (0.376) denotes the fraction of core used by this partition.
          The distribution of the 0.376 across various modes (user, sys, wait, idle) is proportional to the CPU_ALL% in all modes.
         Applying this % would give the PCPU_ALL.

In short, PCPU_ALL represents PURR, SCPU_ALL for SPURR; and CPU_ALL denotes PCPU_ALL modes (user, sys, wait, idle) in percentage, and sum of them should be around 100%.

PCPUnnn represents CPU% of one single HTC (logical CPU, see Table-1), PCPU_ALL is sum of all PCPUnnn across various modes (user, sys, wait, idle).

In case of smt=2 (only two HTCs per core are activated), at each time instance, sum of user and sys in PCPUnnn should be under 43.75%; for each core, sum of user and sys should be under 87.5% (2*43.75); whereas for whole LPAR, sum of user and sys in PCPU_ALL should be under number_of_core * 87.50%.

In case of smt=4 (all 4 HTCs per core are activated), at each time instance, sum of user and sys in PCPUnnn should be under 25.00%; for each core, sum of user and sys should be under 100.00% (4*25.00); whereas for whole LPAR, sum of user and sys in PCPU_ALL should be under number_of_core * 100.00%.

In TOP worksheet, %CPU, %Usr, %Sys also represent PURR. Note that if Threads > 1, they are the sum of all Threads aggregated by PID, hence can be more than 80%.

In Oracle AWR report, %CPU is in PURR too.

8. Power8

(a). SMT=4, for single HTC (smt=1), CPU% = 60.00% instead of 62.50% in Power7 (see Table-1).
       The Throughput Ratio of smt=1 vs. smt=4 is 60.00/25.00 = 2.4 instead of 2.5 in Power7,
       that means about 4% (2.5/62.5 = 0.1/2.5 = 0.04) less than Power7.

(b). SMT=8, for single HTC (smt=1), CPU% = 56.00%
       The Throughput Ratio of smt=1 vs. smt=8 is 56.00/12.50 = 4.48.

The above preliminary figures needs to be further verified.

POWER8 Each core has 16 execution pipelines[13]:

 2  fixed-point pipelines
 2  load-store pipelines
 2* load pipelines (no results to store)
 4* double-precision floating-point pipelines, which can also act as eight single-precision pipelines
 2* fully symmetric vector pipelines with support for VMX and VSX AltiVec instructions.
 1  decimal floating-point pipeline
 1* cryptographic pipeline (AES, Galois Counter Mode, SHA-2)
 1  branch execution pipeline
 1  condition register logical pipeline
 
  Note: All units different from POWER7 are marked by "*". 
        POWER7 Core has 12 execution units, POWER8 16.

9. Conclusion

This Blog presented the POWER7 model of CPU usage and throughput, and examined with real cases. Accurately modelling leads to not only fruitful system tuning and trustful performance assessment, but also fairly charging back and economically resource utilizing (e.g. Power Saver Mode). As coming study, we will investigate the applicability and eventually adaptation of the model on new POWER8 (SMT=8), and the future POWER9.

References

1. POWER7

2. Understanding CPU Utilization on AIX

3. Local, Near & Far Memory part 3 - Scheduling processes to SMT & Virtual Processors

4. P. Mackerras, T. S. Mathews, and R. C. Swanberg. Operating System Exploitation of the POWER5 System.
    IBM J. Res. Dev., 49(4/5):533–539, July 2005.

5. CPU accounting in multi-threaded processors

6. java stored procedure calls and latch: row cache objects, and performance

7. java stored procedure calls and latch: row cache objects

8. Unaccounted Time in 10046 File on AIX SMT4 Platform when Comparing Elapsed and CPU Time (Doc ID 1963791.1)

    Bug 13354348 : UNACCOUNTED GAP BETWEEN ELAPSED TO CPU TIME ON 11.2 IN AIX
    Bug 16044824 - UNACCOUNTED GAP BETWEEN ELAPSED AND CPU TIME FOR DB 11.2 ON PLATFORM AIX POWER7
    Bug 18599013 : NEED TO CALCULATE THE UNACCOUNTED TIME FOR A TRACE FILE
    Bug 7410881 : HOW CPU% UTILIZATION COLLECTED ON AIX VIA EM
    Bug 15925194 : AIX COMPUTING METRICS INCORRECTLY

9. Oracle on AIX - where's my cpu time ?

10. nmon CPU graphs - Why are the PCPU_ALL graphs lower?

11. dW:AIX and UNIX:Performance Tools Forum:Nmon - PCPU_ALL

12. libperfstat.h

13. POWER8

Tablespace	Reads	Av Rds/s	Av Rd(ms)	Av Blks/Rd	1-bk Rds/s	Av 1-bk Rd(ms)	Writes	Writes avg/s	Buffer Waits	Av Buf Wt(ms)
TEST_TBS	25	6	0	7.6	87	1.54	0	22	0	0

Tablespace	Filename	Reads	Av Rds/s	Av Rd(ms)	Av Blks/Rd	1-bk Rds/s	Av 1-bk Rd(ms)	Writes	Writes avg/s	Buffer Waits	Av Buf Wt(ms)
TEST_TBS	test_tbs.dbf	25	6	0	7.6	2	0	87	22	0	0

Tablespace	Reads	Av Rds/s	Av Rd(ms)	Av Blks/Rd	Writes	1-bk Rds/s	Av 1-bk Rd(ms)	Writes avg/s	Buffer Waits	Av Buf Wt(ms)
TEST_TBS	25	6	0	7.6	87	1.54	0	22	0	0

On Oracle

Wednesday, April 29, 2015

IBM AIX POWER7 CPU Usage and Throughput