Wednesday, April 29, 2015

IBM AIX POWER7 CPU Usage and Throughput

This Blog summaries author's observation and understanding on CPU usage of AIX POWER7 SMT4 Architecture.
POWER8 content is added in Section "8 Power8".

1. POWER7 Execution Units


 POWER7 Core is made of 12 execution units [1]:

   2 fixed-point units
   2 load/store units
   4 double-precision floating-point units
   1 vector unit supporting VSX
   1 decimal floating-point unit
   1 branch unit
   1 condition register unit

2. CPU Usage and Throughput


When setting SMT=4, each Core provides 4 Hardware Thread Contexts(HTC, logic CPU) and can simultaneously executes 4 Software Threads (Processes, Tasks).

For example, if more then 2 threads want to run concurrently floating point at the same time (cycle), then the 3rd and 4th thread would have to wait CPU cycles for access to the two FP units to be free. Therefore, with SMT=4, number of instructions executed by a single HTC slows down, but overall throughput goes up per core. IBM claims 60% boost of throughput. That means when 4 processes run on a core (smt=4), it delivers 1.6 times throughput than a a single process per core ([2], [3]). In case of smt=2, the boost is 40%, or 1.4 times throughput (note that we use lower case "smt=4" to differ it from POWER SMT configuration of SMT=4).

Mathematically, with smt=4, one could think that 25% Core usage provide 40% CPU power. The response time is increased from 1 to 2.5 (= 1/0.4) instead of 4.

Now it comes puzzle, how much should we show the CPU usage for each HTC and each process ? 25% or 40% in the above example ? In general, measuring and modelling SMT cpu usage is an on-going research subjects ([5]).

POWER7 is advanced with a new Model of CPU usage. The general intent of POWER7 is to provide a measure of CPU utilization wherein there is a linear relationship between the current throughput (e.g., transactions per second) and the CPU utilization being measured for that level of throughput [2].

In smt=4, 4 Threads running per core, each Thread shares 25% of a whole core, and provides 40% throughput in comparing to smt=1. To build up the linear relation of throughput to CPU usage, the CPU usage of smt=1, 2, 3, 4 can be computed as:

 CPU%(smt=1) = (1.0/0.4) * 25% = 62.50%

 CPU%(smt=2) = (0.7/0.4) * 25% = 43.75%

 CPU%(smt=3) = (0.5/0.4) * 25% = 31.25%

  CPU%(smt=4) = (0.4/0.4) * 25% = 25.00%

Note that for smt=3, boost of 50%, or 1.5 times throughput, stems from this Blog's test, and can be inaccurate.

Expressed in linear equation, it looks like:

  t = f(s) * u

where t for Throughput, s for smt (with possible value of 1, 2, 3, 4), u for CPU Usage.

Putting all together, we can draw a table:

smt Throughput/core Throughput/HTC CPU%
1 1.0 1.0 62.50
2 1.4 0.7 43.75
3 1.5 0.5 31.25
4 1.6 0.4 25.00

Table-1

Therefore, maximum CPU usage of HTC (logic CPU) and Software Thread (Process or Task) is 62.5%. In POWER7 SMT=4, it would be astonishing if it were possible to observe a Process CPU usage more than 65%,  or a HTC's CPU usage more than 65% (mpstat -s).

Picking performance test data out of Blog [6] Table-1 (tested on POWER7, 4 Core, SMT=4, Oracle 11.2.0.3.0), and verifying against above linear relations:

JOB_CNT HTC/Core C2_RUN_CNT Throughtput/HTC Throughput_Based_CPU% Throughput_Ratio_to_Min Theory_Throughput/HTC
1 1 119 119 (119/1) 64.67 2.59 (119/46) 115.00
8 2 580 73 (580/8) 39.40 1.58 (73/46) 80.50
12 3 654 55 (654/12) 29.89 1.20 (55/46) 57.50
16 4 730 46 (730/16) 25.00 1.00 (46/46) 46.00

Table-2

where Throughput_Based_CPU%:

   (119/46)*25% = 64.67%
   (73/46)*25%  = 39.40%
   (55/46)*25%  = 29.89%

and Theory_Throughput/HTC based on linear interpolation:

   46*(0.6250/0.25) = 115.00
   46*(0.4375/0.25) = 80.50
   46*(0.3125/0.25) = 57.50

Table-2 shows that the Theory_Throughput is close to tested Throughput. Thus the designated CPU usage is a calibrated, scalable metric (notice: "S" in POWER8 servers signifies Scale-out.).

Usually applications with a lot of transaction are benchmarked in terms of Throughput, hence binding Throughput linearly to CPU usage is a practical approach to assess application performance.

In principle, CPU usage represents the throughput, and its complement (1 - Usage) stands for the remaining available capacity. One process running in one core with CPU usage of 62.5% on first HTC stands for that there is still 37.5% available capacity on other 3 HTCs, each of which can share a portion of 12.5%.

In practice, CPU utilization can be applied as metric for charging back of computing resources used, and its complement can be used for predication of capacity planning.

This model of SMT CPU accounting is not widely acknowledged, and therefore caused confusion. For example, Oracle Note 1963791.1:

 Unaccounted Time in 10046 File on AIX SMT4 Platform when Comparing Elapsed and CPU Time (Doc ID 1963791.1) [8]

where session trace records:
 cpu time = 86.86   waited time = 7.06  elapsed time = 142.64

and the difference:
 142.64 - (86.86 + 7.06) = 48.72 seconds,
is interpreted as "Unaccounted Time".

In fact,
   86.86/142.64 = 60.90%,
indicates that almost a single Oracle session alone occupies one full core.

Blog [9] also reported the similar observation on AIX POWER7 and trying to explain the unaccounted time.

Probably people working in other UNIX (Solaris, HP-UX, Linux) gets used to intuitive interpretation of CPU time and elapsed time, but with the advancing of multi-threaded processors like AIX, an inception of rethinking would help disperse the confusion so that CPU resource can be efficiently allocated and accurately assessed.

3. POWER PURR


According to [2][4],

 POWER5 includes a per-thread processor utilization of resources register (PURR), which increments at the timebase frequency multiplied by the fraction of cycles on which the thread can dispatch instructions.

 Beginning with IBM® POWER5 TM processor architecture, a new register, PURR, is introduced to assist in computing the utilization.
 The PURR stands for Processor Utilization of Resources Register and its available per Hardware Thread Context.

 The PURR counts in proportion to the real time clock (timebase)
 The SPURR stands for Scaled Processor Utilization of Resources Register.
 The SPURR is similar to PURR except that it increments proportionally to the processor core frequency.
 The AIX® lparstat, sar & mpstat utilities are modified to report the PURR-SPURR ratio via a new column, named "nsp".

and it demonstrates the enhanced command: time (timex), sar -P ALL, mpstat -s, lpstat -E

AIX provides command pprof with flag: -r PURR to report CPU usage in PURR time instead of TimeBase.

For example, start one CPU intensive Oracle session (process) in one core for a duration of 120 seconds:

  exec xpp_test.run_job(p_case => 2, p_job_cnt => 1, p_dur_seconds => 120);
 (see TestCase in Blog [7], POWER7, 4 Core, SMT=4)

in this case, single Core runs in smt=1, then tracking its PURR time for 100 seconds by:
 pprof 100 -r PURR

and displaying the report by:
 head -n 50 pprof.cpu

The output shows (irrelevant lines removed):

                    Pprof CPU Report
        E = Exec'd      F = Forked
        X = Exited      A = Alive (when traced started or stopped)
        C = Thread Created
        * = Purr based values
               Pname      PID     PPID  BE      TID     PTID ACC_time*  STT_time  STP_time   STP-STT
               =====    =====    ===== ===    =====    ===== =========  ========  ========  ========
     ora_j000_testdb 42598406  7864540  AA 21299317        0    62.930     0.037    99.805    99.768

If tracking with TimeBase by:
 pprof 100

The output (head -n 50 pprof.cpu) looks like:

               Pname      PID     PPID  BE      TID     PTID  ACC_time  STT_time  STP_time   STP-STT
               =====    =====    ===== ===    =====    =====  ========  ========  ========  ========
     ora_j000_testdb  1835064        0  AA  2687059        0    99.899     0.016    99.916    99.900

Continues with our example by starting 8 CPU intensive Oracle sessions (each Core runs in smt=2):

  exec xpp_test.run_job(p_case => 2, p_job_cnt => 8, p_dur_seconds => 120);

and look PURR report for one Oracle process:

               Pname      PID     PPID  BE      TID     PTID ACC_time*  STT_time  STP_time   STP-STT
               =====    =====    ===== ===    =====    ===== =========  ========  ========  ========
     ora_j007_testdb 17760298  7864540  AA 57475195        0    42.910     0.340    99.210    98.870
    
then starting 12 CPU intensive Oracle sessions (each Core runs in smt=3):

  exec xpp_test.run_job(p_case => 2, p_job_cnt => 12, p_dur_seconds => 120);

and look PURR report for one Oracle process:
   
               Pname      PID     PPID  BE      TID     PTID ACC_time*  STT_time  STP_time   STP-STT
               =====    =====    ===== ===    =====    ===== =========  ========  ========  ========
     ora_j007_testdb 33095898  7864540  AA 50135123        0    30.658     0.017   100.008    99.990

And finally starting 16 CPU intensive Oracle sessions (each Core runs in smt=4):

  exec xpp_test.run_job(p_case => 2, p_job_cnt => 16, p_dur_seconds => 120);

and look PURR report for one Oracle process:

               Pname      PID     PPID  BE      TID     PTID ACC_time*  STT_time  STP_time   STP-STT
               =====    =====    ===== ===    =====    ===== =========  ========  ========  ========
     ora_j014_testdb 33488964  7864540  AA 73531621        0    24.673     0.143    99.145    99.002

We can see that ACC_time* correlates well with CPU% of Table-1. The little difference is probably due to single point of contention on Oracle latch: row cache objects - child: dc_users [7].

The study[4] shows that PURR limits its inaccuracy to 26% in the single core POWER5 configuration (CMP+SMT).

Additionally, in AIX, the number of physical processors consumed is reported by sar.physc and vmstat.pc, percentage of entitled capacity consumed is reported by sar.%entc and vmstat.ec.

By the way, Linux on POWER machine reads PURR at regular intervals and make the values available through a file in the procfs [4].


3.1 PURR APIs


AIX perfstat_cpu APIs include:
  perfstat_cpu_total(), perfstat_partition_total(), and perfstat_cpu_total_wpar(), perfstat_cpu_util().
perfstat_partition_total Subroutine retrieves global Micro-Partitioning® usage statistics into:
  perfstat_partition_total_t lparstats;  
In libperfstat.h File[12], perfstat_partition_total_t contains the following PURR members:

  purr_counter            Number of purr cycles spent in user and kernel mode.
  u_longlong_t puser  Raw number of physical processor ticks in user mode.
  u_longlong_t psys   Raw number of physical processor ticks in system mode.
  u_longlong_t pidle  Raw number of physical processor ticks idle.
  u_longlong_t pwait  Raw number of physical processor ticks waiting for I/O.    
perfstat_cpu_total Subroutine fills perfstat_cpu_total_t, which contains the following members:

  u_longlong_t user  Raw total number of clock ticks spent in user mode.
  u_longlong_t sys   Raw total number of clock ticks spent in system mode.
  u_longlong_t idle  Raw total number of clock ticks spent idle.
  u_longlong_t wait  Raw total number of clock ticks spent waiting for I/O.
(see File: libperfstat.h)

perfstat_partition_total Interface demonstrates emulating the lparstat command including PCPU:

  /* calculate physcial processor tics during the last interval in user, system, idle and wait mode  */
  delta_pcpu_user  = lparstats.puser - last_pcpu_user; 
  delta_pcpu_sys   = lparstats.psys  - last_pcpu_sys;
  delta_pcpu_idle  = lparstats.pidle - last_pcpu_idle;
  delta_pcpu_wait  = lparstats.pwait - last_pcpu_wait; 
Oracle'link to AIX perfstat_cpu_total is visible by:

$ nm -Xany $ORACLE_HOME/bin/oracle |grep perfstat_cpu_total

Name                 Type  Value       Size
-------------------  ----  ----------  ----
.perfstat_cpu_total  T     4580969124
.perfstat_cpu_total  t     4580969124   44
perfstat_cpu_total   U              -
perfstat_cpu_total   d     4861786384    8

 Type:
   T Global text symbol
   t Local text symbol
   U Undefined symbol
   d Local data symbol
but it is not clear how Oracle calls perfstat_partition_total.

4. Thread Scheduling and Throughput


Picking performance test data out of Blog [6] Table-3, and adding smt per Core, we got:

JOB_CNT C1_RUN_CNT(Throughput) MIN MAX Core_1_smt Core_2_smt Core_3_smt Core_4_smt Theory_Throughput
1 118 118 118 1 0 0 0 118.00
2 240 120 120 1 1 0 0 236.00
3 360 120 120 1 1 1 0 354.00
4 461 109 120 1 1 1 1 472.00
5 476 74 119 2 1 1 1 519.20
6 515 75 97 2 2 1 1 566.40
7 551 66 100 2 2 2 1 613.60
8 569 63 77 2 2 2 2 660.80
9 597 58 76 3 2 2 2 672.60
10 601 56 73 3 3 2 2 684.40
11 613 48 67 3 3 3 2 696.20
12 646 49 64 3 3 3 3 708.00
13 683 47 66 4 3 3 3 719.80
14 696 46 65 4 4 3 3 731.60
15 714 45 51 4 4 4 3 731.60
16 733 44 47 4 4 4 4 755.20

Table-3

where Theory_Throughput is calculated based on above Table-1 and smt per Core, for example:

 JOB_CNT=5,  118*1.4*1+118*3     = 519.2
 JOB_CNT=11, 118*1.5*3+118*1.4*1 = 696.2
 JOB_CNT=15, 118*1.6*2+118*1.5*2 = 731.6

Blog [3] reveals the particularity on non-existence of smt=3 mode, which said: when starting 9 processes (Oracle sessions) on a POWER7 4 Cores SMT=4, there will be 1 Cores running with 4 HTCs, and 1 Core having only 1 HTC, and 2 Cores with 2 HTCs  in total, 9 HTCs for 9 Oracle sessions.

As we tested by running:

  exec xpp_test.run_job(p_case => 2, p_job_cnt => 9, p_dur_seconds => 120);
 
"pprof 100 -r PURR" shows:

            Pname      PID     PPID  BE      TID     PTID ACC_time*  STT_time  STP_time   STP-STT
            =====    =====    ===== ===    =====    ===== =========  ========  ========  ========
  ora_j000_testdb 43646996  7864540  AA 65273961        0    43.578     0.008    97.823    97.815
  ora_j001_testdb 14090250  7864540  AA 45744377        0    43.314     1.621   100.010    98.389
  ora_j002_testdb 38338754  7864540  AA 28442745        0    40.696     0.007   100.000    99.993
  ora_j003_testdb 33095926  7864540  AA 78119153        0    36.575     0.010    99.922    99.912
  ora_j004_testdb 39583756  7864540  AA 35258545        0    36.204     2.242    97.824    95.582
  ora_j005_testdb 42401958  7864540  AA 73662611        0    36.020     1.131    99.731    98.600
  ora_j006_testdb 12976182  7864540  AA 68681805        0    35.944     2.212    99.912    97.700
  ora_j007_testdb 49086646  7864540  AA 75563151        0    32.372     1.893    97.823    95.930
  ora_j008_testdb  7602206  7864540  AA 32112699        0    31.676     1.893    97.823    95.930


"mpstat -ws" shows:

              Proc0                           Proc4            
              94.56%                          99.92%           
  cpu0    cpu1    cpu2    cpu3    cpu4    cpu5    cpu6    cpu7 
  22.43%  24.13%  36.16%  11.84%  30.39%  20.72%  30.08%  18.72%
 
              Proc8                          Proc12            
             100.00%                         100.00%           
  cpu8    cpu9    cpu10   cpu11   cpu12   cpu13   cpu14   cpu15
  25.72%  24.80%  16.14%  33.35%  48.67%  46.49%   2.35%   2.49%


"sar -P ALL" shows:

  cpu    %usr    %sys    %wio   %idle   physc   %entc
    0      47       6       0      47    0.20     5.1
    1      60       0       0      40    0.24     5.9
    2      92       0       0       8    0.38     9.4
    3      14       0       0      86    0.13     3.2
    4      99       0       0       1    0.32     8.1
    5      87       0       0      13    0.19     4.7
    6      89       0       0      11    0.30     7.6
    7      73       0       0      27    0.19     4.6
    8     100       0       0       0    0.26     6.4
    9     100       0       0       0    0.25     6.2
    10     43       0       0      57    0.15     3.8
    11    100       0       0       0    0.34     8.6
    12    100       0       0       0    0.48    12.0
    13    100       0       0       0    0.48    12.0
    14      0       0       0     100    0.02     0.5
    15     11       0       0      89    0.02     0.5
    U       -       -       0       2    0.05     1.3
    -      84       0       0      16    3.95    98.7


However, the above outputs shows that 2 HTCs (cpu14, cpu15) are almost idle, one (cpu3) has low workload, and the other 13 HTCs are more or less busy, probably only top 9 busy HTCs are for 9 running Oracle sessions.

Two idle HTCs (cpu14, cpu15) in Core 4 could also signify that 3 Cores running in smt=4 and one Core in smt=2.

Applying it to 15 processes (Oracle sessions), there will be 3 Cores running with 4 HTCs, and one Core having only 2 HTCs, in total, 14 HTCs for 15 Oracle sessions.

Let's test it by:

  exec xpp_test.run_job(p_case => 2, p_job_cnt => 15, p_dur_seconds => 120);

"pprof 100 -r PURR" shows:

               Pname      PID     PPID  BE      TID     PTID ACC_time*  STT_time  STP_time   STP-STT
               =====    =====    ===== ===    =====    ===== =========  ========  ========  ========
     ora_j000_testdb 40108192  7864540  AA 15859877        0    27.389     0.020   100.003    99.983
     ora_j001_testdb 11927686  7864540  AA 40697939        0    27.277     0.023   100.020    99.997
     ora_j002_testdb 43647120  7864540  AA 30867661        0    26.892     0.657   100.003    99.346
     ora_j003_testdb 17760278  7864540  AA 73531567        0    26.695     0.009    99.040    99.031
     ora_j004_testdb  8388634  7864540  AA 60424235        0    26.576     0.003   100.013   100.011
     ora_j005_testdb 33095868  7864540  AA 30801969        0    25.437     0.657    98.956    98.300
     ora_j006_testdb 37290002  7864540  AA 26411087        0    25.278     0.656    97.135    96.478
     ora_j007_testdb 44105940  7864540  AA 39190655        0    24.977     0.003   100.009   100.006
     ora_j008_testdb 45285388  7864540  AA 72482871        0    24.773     0.735   100.013    99.279
     ora_j009_testdb 33488948  7864540  AA 75628593        0    24.262     0.015    96.875    96.860
     ora_j010_testdb 13828318  7864540  AA 67240145        0    24.256     0.016    96.876    96.860
     ora_j011_testdb 42926108  7864540  AA 64421997        0    24.233     0.024    96.874    96.850
     ora_j012_testdb 42336484  7864540  AA 32112833        0    24.180     0.016    96.876    96.860
     ora_j013_testdb 38600812  7864540  AA 47972531        0    24.049     0.734    96.874    96.140
     ora_j014_testdb 46727262  7864540  AA 25690295        0    24.047     0.734    96.874    96.140

"mpstat -s" shows:

              Proc0                           Proc4            
              99.99%                         100.00%           
  cpu0    cpu1    cpu2    cpu3    cpu4    cpu5    cpu6    cpu7 
  21.33%  28.15%  24.69%  25.81%  25.00%  25.03%  24.98%  24.99%
 
              Proc8                          Proc12            
             100.00%                         100.00%           
  cpu8    cpu9    cpu10   cpu11   cpu12   cpu13   cpu14   cpu15
  25.03%  25.02%  24.66%  25.29%  25.92%  24.12%  24.95%  25.02%


"sar -P ALL" shows:

  cpu    %usr    %sys    %wio   %idle   physc   %entc
   0       72       5       0      23    0.21     5.3
   1       94       0       0       6    0.28     7.1
   2       91       0       0       9    0.24     6.1
   3       95       0       0       5    0.26     6.5
   4       99       0       0       1    0.25     6.2
   5      100       0       0       0    0.25     6.3
   6      100       0       0       0    0.25     6.3
   7      100       0       0       0    0.25     6.2
   8      100       0       0       0    0.25     6.3
   9      100       0       0       0    0.25     6.3
   10      99       0       0       1    0.25     6.2
   11     100       0       0       0    0.25     6.3
   12     100       0       0       0    0.25     6.4
   13      98       0       0       2    0.25     6.1
   14     100       0       0       0    0.25     6.2
   15     100       0       0       0    0.25     6.3
   -       97       0       0       3    4.00   100.0


The above outputs deliver no evidence of non-existence of smt=3 mode. It could be possible that I missed some points here. It will be interesting to see how to demonstrate it.

5. AIX Internal Code


AIX "struct procinfo" used in getprocs Subroutine (/usr/include/procinfo.h) contains a comment on pi_cpu:

  struct  procinfo
  {
    /* scheduler information */
    unsigned long   pi_pri;         /* priority, 0 high, 31 low */
    unsigned long   pi_nice;        /* nice for priority, 0 to 39 */
    unsigned long   pi_cpu;         /* processor usage, 0 to 80 */ 
Probably in the mind of AIX developers, process processor usage is not allowed to be over 80.

All fields in Oracle procsinfo (not AIX struct "procinfo") can be listed by:

$ nm -Xany $ORACLE_HOME/bin/oracle |grep -i procsinfo

  procsinfo:T153=s1568
  pi_pid:121,0,32;
  pi_ppid:121,32,32;
  ...
  pi_cpu:123,448,32;
  ...
  pi_utime:123,864,32;
  ...
  pi_stime:123,928,32;
  ...

6. vpm_throughput_mode


This AIX scheduler tunable parameter specifies the desired level of SMT exploitation for scaled throughput mode.

A value of 0 gives default behavior (raw throughput mode).
A value of 1, 2, or 4 selects the scaled throughput mode and the desired level of SMT exploitation. It is the number of threads used by one core before using next core.

schedo –p –o vpm_throughput_mode=
  0 Legacy Raw mode (default)
  1 Enhanced Raw mode with a higher threshold than legacy
  2 Scaled mode, use primary and secondary SMT threads
  4 Scaled mode, use all four SMT threads


Raw Mode (0, 1)

provides the highest per-thread throughput and best response times at the expense of activating more physical core. For example, Legacy Raw mode (default) dispatches workload to all primary threads before using any secondary threads.

Secondary threads are activated when the load of all primary threads is over certain utilization, probably 50%, and new workload (process) comes to be dispatched for running.

3rd and 4th threads are activated when the load of secondary threads is over certain utilization, probably 20%, and new workload (process) comes to be dispatched for running.

Scaled Mode (2, 4)

intends the highest per-core throughput (in the specified mode: 2 or 4) at the expense of per-thread response times and throughput. For example, Scaled mode 2 dispatches workload to both primary and secondary threads of one core before using those of next core. Scaled mode 4 dispatches workload to all 4 threads of one core before using those of next core.

In Scaled mode 2, 1st and 2nd threads of each core are bound together, thus both have the similar workload (CPU Usage). 3rd and 4th threads are activated when the load of 1st and 2nd threads is over certain utilization, probably 30%, and new workload (process) comes to be dispatched for running.

Note that this tuning intention is per active core, not all cores in the LPAR. In fact, it is aimed at activating less cores. It would be a setting conceived for a test system with a few LPARs.

Referring to Table-1, vpm_throughput_mode = 2 is corresponding to smt = 2, two threads are running per core, Throughput/HTC = 0.7, CPU% = 43.75. In real applications with Scaled mode 2, we also observed that CPU% is constrained under 43% even if runqueue is shorter than number of cores. That means even though workload is low, CPU% can not score up to its maximum of 62.50, and applications can not benefit from the maximum Throughput/HTC. For the performance critical application, Scaled mode is questionable. On the contrary, Raw Mode automatically tunes the CPU% based on workload. That is probably why vpm_throughput_mode is in default set to 0.

We can see there is no vpm_throughput_mode=3. Probably it is related to Blog [3] mentioned the particularity on non-existence of smt=3 mode.

There is also a naming confusion. In default, POWER7 runs in "Legacy Raw mode", and POWER6 behaves like "scaled throughput mode". Normally "Legacy" means it was used in some previous model or release,  but here POWER6 uses something like "Scaled mode, and a later model (POWER7) introduced a "Legacy" mode 0.

7. NMON Report


NMON report contains three aspects of worksheets on CPU usage: PCPU_ALL (PCPUnnn), SCPU_ALL (SCPUnnn), CPU_ALL (CPUnnn).

AIXpert Blog[10] said:

   If I had to guess then the Utilisation numbers in our PCPU_ALL graph (above) have been scaled from 75 cores to roughly 62 cores so "show" some SMT threads are unused so the CPU cores are not fully used (and given enough threads it could give you more performance). Roughly 10 - 15% more. Now, in my humble opinion, this is totally the wrong way of doing this as it is just plain confusing.

   The PCPU and SCPU stats where (in my humble opinion) a confusing mistake and only useful if you have the CPUs in Power saving mode i.e. its changing the CPU GHz to save electrical power.

and IBM developerWorks Forum[11] described:

   PCPU_ALL is the actual physical resource consumption. It would be in units of cores.
   SCPU_ALL is the scaled physical resource consumption. Differs from PCPU_ALL if running at non-nominal frequency.
          Again in units of cores. SCPU, PCPU do not differ when the system runs in the nominal frequency.
   CPU_ALL: PhysicalCPU tag (0.376) denotes the fraction of core used by this partition.
          The distribution of the 0.376 across various modes (user, sys, wait, idle) is proportional to the CPU_ALL% in all modes. 
         Applying this % would give the PCPU_ALL.

In short, PCPU_ALL represents PURR, SCPU_ALL for SPURR; and CPU_ALL denotes PCPU_ALL modes (user, sys, wait, idle) in percentage, and sum of them should be around 100%.

PCPUnnn represents CPU% of one single HTC (logical CPU, see Table-1), PCPU_ALL is sum of all PCPUnnn across various modes (user, sys, wait, idle).

In case of smt=2 (only two HTCs per core are activated), at each time instance, sum of user and sys in PCPUnnn should be under 43.75%; for each core, sum of user and sys should be under 87.5% (2*43.75); whereas for whole LPAR, sum of user and sys in PCPU_ALL should be under number_of_core * 87.50%.

In case of smt=4 (all 4 HTCs per core are activated), at each time instance, sum of user and sys in PCPUnnn should be under 25.00%; for each core, sum of user and sys should be under 100.00% (4*25.00); whereas for whole LPAR, sum of user and sys in PCPU_ALL should be under number_of_core * 100.00%.

In TOP worksheet, %CPU, %Usr, %Sys also represent PURR. Note that if Threads > 1, they are the sum of all Threads aggregated by PID, hence can be more than 80%.

In Oracle AWR report, %CPU is in PURR too.

8. Power8


(a). SMT=4, for single HTC (smt=1), CPU% = 60.00% instead of 62.50% in Power7 (see Table-1).
       The Throughput Ratio of smt=1 vs. smt=4 is 60.00/25.00 = 2.4 instead of 2.5 in Power7,
       that means about 4% (2.5/62.5 = 0.1/2.5 = 0.04) less than Power7.
  
(b). SMT=8, for single HTC (smt=1), CPU% = 56.00%
       The Throughput Ratio of smt=1 vs. smt=8 is 56.00/12.50 = 4.48.

The above preliminary figures needs to be further verified.

POWER8 Each core has 16 execution pipelines[13]:
 2  fixed-point pipelines
 2  load-store pipelines
 2* load pipelines (no results to store)
 4* double-precision floating-point pipelines, which can also act as eight single-precision pipelines
 2* fully symmetric vector pipelines with support for VMX and VSX AltiVec instructions.
 1  decimal floating-point pipeline
 1* cryptographic pipeline (AES, Galois Counter Mode, SHA-2)
 1  branch execution pipeline
 1  condition register logical pipeline
 
  Note: All units different from POWER7 are marked by "*". 
        POWER7 Core has 12 execution units, POWER8 16.

9. Conclusion


This Blog presented the POWER7 model of CPU usage and throughput, and examined with real cases. Accurately modelling leads to not only fruitful system tuning and trustful performance assessment, but also fairly charging back and economically resource utilizing (e.g. Power Saver Mode). As coming study, we will investigate the applicability and eventually adaptation of the model on new POWER8 (SMT=8), and the future POWER9.

References


1. POWER7

2. Understanding CPU Utilization on AIX

3. Local, Near & Far Memory part 3 - Scheduling processes to SMT & Virtual Processors

4. P. Mackerras, T. S. Mathews, and R. C. Swanberg. Operating System Exploitation of the POWER5 System.
    IBM J. Res. Dev., 49(4/5):533–539, July 2005.

5. CPU accounting in multi-threaded processors

6. java stored procedure calls and latch: row cache objects, and performance

7. java stored procedure calls and latch: row cache objects

8. Unaccounted Time in 10046 File on AIX SMT4 Platform when Comparing Elapsed and CPU Time (Doc ID 1963791.1)

    Bug 13354348 : UNACCOUNTED GAP BETWEEN ELAPSED TO CPU TIME ON 11.2 IN AIX
    Bug 16044824 - UNACCOUNTED GAP BETWEEN ELAPSED AND CPU TIME FOR DB 11.2 ON PLATFORM AIX POWER7
    Bug 18599013 : NEED TO CALCULATE THE UNACCOUNTED TIME FOR A TRACE FILE
    Bug 7410881 : HOW CPU% UTILIZATION COLLECTED ON AIX VIA EM
    Bug 15925194 : AIX COMPUTING METRICS INCORRECTLY

9. Oracle on AIX - where's my cpu time ?

10. nmon CPU graphs - Why are the PCPU_ALL graphs lower?

11. dW:AIX and UNIX:Performance Tools Forum:Nmon - PCPU_ALL

12. libperfstat.h

13. POWER8

Monday, April 20, 2015

Oracle 11.2.0.4.0 AWR "Tablespace IO Stats" Column Names Shifted

Oracle 11.2.0.4.0 added two new columns in Section "Tablespace IO Stats" and "File IO Stats":

 1-bk Rds/s
 Av 1-bk Rd(ms)


but in "Tablespace IO Stats", both Column Names do not match the content in the table.

Running the appended TestCase, we got the AWR report for "Tablespace IO Stats":


Tablespace Reads Av Rds/s Av Rd(ms) Av Blks/Rd 1-bk Rds/s Av 1-bk Rd(ms) Writes Writes avg/s Buffer Waits Av Buf Wt(ms)
TEST_TBS 25 6 0 7.6 87 1.54 0 22 0 0


and "File IO Stats":

Tablespace Filename Reads Av Rds/s Av Rd(ms) Av Blks/Rd 1-bk Rds/s Av 1-bk Rd(ms) Writes Writes avg/s Buffer Waits Av Buf Wt(ms)
TEST_TBS test_tbs.dbf 25 6 0 7.6 2 0 87 22 0 0


In "Tablespace IO Stats", "1-bk Rds/s" and "Av 1-bk Rd(ms)" have to be switched with Column "Writes" so that Names are in sync with Content.

Tablespace Reads Av Rds/s Av Rd(ms) Av Blks/Rd Writes 1-bk Rds/s Av 1-bk Rd(ms) Writes avg/s Buffer Waits Av Buf Wt(ms)
TEST_TBS 25 6 0 7.6 87 1.54 0 22 0 0


Another difference noticed is that both "Tablespace IO Stats" and "File IO Stats" report Read statistics:
     Av Rd(ms)
  Av Blks/Rd

but there are no symmetrical figures on Write like:
   *Av Wr(ms)
 *Av Blks/Wr


With following query, we can fulfil these statistics:

select filename, file#, snap_id
      ,round(phyrds_d)                             "Reads"
      ,round(phyrds_d/interval_seconds)            "Av Reads/s"      
      ,round(readtim_d*10/nullif(phyrds_d, 0))     "Av Rd(ms)"       
      ,round(phyblkrd_d/nullif(phyrds_d, 0))       "Av Blks/Rd"  
      ,round(singleblkrds_d/interval_seconds)      "1-bk Rds/s"  
      ,round(singleblkrdtim_d*10/nullif(singleblkrds_d, 0))     "Av 1-bk Rd(ms)"
      ,round(phywrts_d)                            "Writes"  
      ,round(phywrts_d/interval_seconds)           "Av Writes/s"  
      ,round(writetim_d*10/nullif(phywrts_d, 0))   "*Av Wr(ms)"      -- * Not in AWR
      ,round(phyblkwrt_d/nullif(phywrts_d, 0))     "*Av Blks/Wr"     -- * Not in AWR
      ,round(wait_count_d)                         "Buffer Waits"  
      ,round(time_d*10/nullif(wait_count_d, 0))    "Av Buf Wt(ms)"   -- in CentiSeconds
from (
  select 
     phyrds - lag(phyrds) over(partition by file# order by snap_id) phyrds_d
    ,phywrts - lag(phywrts) over(partition by file# order by snap_id) phywrts_d
    ,singleblkrds - lag(singleblkrds) over(partition by file# order by snap_id) singleblkrds_d
    ,readtim - lag(readtim) over(partition by file# order by snap_id) readtim_d
    ,writetim - lag(writetim) over(partition by file# order by snap_id) writetim_d
    ,singleblkrdtim - lag(singleblkrdtim) over(partition by file# order by snap_id) singleblkrdtim_d
    ,phyblkrd - lag(phyblkrd) over(partition by file# order by snap_id) phyblkrd_d
    ,phyblkwrt - lag(phyblkwrt) over(partition by file# order by snap_id) phyblkwrt_d
    ,wait_count - lag(wait_count) over(partition by file# order by snap_id) wait_count_d   
    ,time - lag(time) over(partition by file# order by snap_id) time_d   
    ,interval_seconds
    ,t.*
from dba_hist_filestatxs t
   ,(select snap_id s_snap_id
           ,((sysdate + (end_interval_time - begin_interval_time)) - sysdate)*86400 interval_seconds 
       from dba_hist_snapshot)
where t.snap_id = s_snap_id
  and tsname = 'TEST_TBS'
  and snap_id >= (select max(snap_id) from dba_hist_snapshot) - 2
);


TestCode


Run code block:

 drop tablespace test_tbs including contents;
 
 create tablespace test_tbs datafile 'test_tbs.dbf' size 100m reuse online;
 
 drop table testt;
 
 create table testt(x number, y varchar2(1000)) tablespace test_tbs;
 
 exec sys.dbms_workload_repository.create_snapshot('ALL'); 
 
 insert into testt select level x, rpad('abc', 1000, 'x') y from dual connect by level <= 1000;
 
 commit;
 
 alter system flush buffer_cache; 
 
 select count(*) from testt;
 
 exec dbms_lock.sleep(3);
 
 exec sys.dbms_workload_repository.create_snapshot('ALL'); 
 
 select bytes, blocks from dba_segments where segment_name = 'TESTT';
And get AWR report by:

select * from table(SYS.DBMS_WORKLOAD_REPOSITORY.awr_report_html(
  (select dbid from v$database), 1, 
  (select max(snap_id) from dba_hist_snapshot) - 1, 
  (select max(snap_id) from dba_hist_snapshot)));
There could exist other misleading information in AWR report. Recently we were puzzled by a high Session UGA and PGA memory reported in AWR, where UGA is much larger than PGA in a dedicated server:
  session pga memory max    308,253,725,520    302,308,652,008
  session uga memory max  5,614,577,384,552  5,666,238,828,552
At the end, we found one MOS Note:
    High Session UGA & PGA Memory Reported in AWR (Doc ID 1483177.1)
which said:
    These statistics will be removed from AWR report in future versions, could be 12.1 
    so you should not depend on these fake numbers for investigations of performance issues.