POWER8 content is added in Section "8 Power8".
1. POWER7 Execution Units
POWER7 Core is made of 12 execution units [1]:
2 fixed-point units
2 load/store units
4 double-precision floating-point units
1 vector unit supporting VSX
1 decimal floating-point unit
1 branch unit
1 condition register unit
2. CPU Usage and Throughput
When setting SMT=4, each Core provides 4 Hardware Thread Contexts(HTC, logic CPU) and can simultaneously executes 4 Software Threads (Processes, Tasks).
For example, if more then 2 threads want to run concurrently floating point at the same time (cycle), then the 3rd and 4th thread would have to wait CPU cycles for access to the two FP units to be free. Therefore, with SMT=4, number of instructions executed by a single HTC slows down, but overall throughput goes up per core. IBM claims 60% boost of throughput. That means when 4 processes run on a core (smt=4), it delivers 1.6 times throughput than a a single process per core ([2], [3]). In case of smt=2, the boost is 40%, or 1.4 times throughput (note that we use lower case "smt=4" to differ it from POWER SMT configuration of SMT=4).
Mathematically, with smt=4, one could think that 25% Core usage provide 40% CPU power. The response time is increased from 1 to 2.5 (= 1/0.4) instead of 4.
Now it comes puzzle, how much should we show the CPU usage for each HTC and each process ? 25% or 40% in the above example ? In general, measuring and modelling SMT cpu usage is an on-going research subjects ([5]).
POWER7 is advanced with a new Model of CPU usage. The general intent of POWER7 is to provide a measure of CPU utilization wherein there is a linear relationship between the current throughput (e.g., transactions per second) and the CPU utilization being measured for that level of throughput [2].
In smt=4, 4 Threads running per core, each Thread shares 25% of a whole core, and provides 40% throughput in comparing to smt=1. To build up the linear relation of throughput to CPU usage, the CPU usage of smt=1, 2, 3, 4 can be computed as:
CPU%(smt=1) = (1.0/0.4) * 25% = 62.50%
CPU%(smt=2) = (0.7/0.4) * 25% = 43.75%
CPU%(smt=3) = (0.5/0.4) * 25% = 31.25%
CPU%(smt=4) = (0.4/0.4) * 25% = 25.00%
Note that for smt=3, boost of 50%, or 1.5 times throughput, stems from this Blog's test, and can be inaccurate.
Expressed in linear equation, it looks like:
t = f(s) * u
where t for Throughput, s for smt (with possible value of 1, 2, 3, 4), u for CPU Usage.
Putting all together, we can draw a table:
smt | Throughput/core | Throughput/HTC | CPU% |
1 | 1.0 | 1.0 | 62.50 |
2 | 1.4 | 0.7 | 43.75 |
3 | 1.5 | 0.5 | 31.25 |
4 | 1.6 | 0.4 | 25.00 |
Table-1
Therefore, maximum CPU usage of HTC (logic CPU) and Software Thread (Process or Task) is 62.5%. In POWER7 SMT=4, it would be astonishing if it were possible to observe a Process CPU usage more than 65%, or a HTC's CPU usage more than 65% (mpstat -s).
Picking performance test data out of Blog [6] Table-1 (tested on POWER7, 4 Core, SMT=4, Oracle 11.2.0.3.0), and verifying against above linear relations:
JOB_CNT | HTC/Core | C2_RUN_CNT | Throughtput/HTC | Throughput_Based_CPU% | Throughput_Ratio_to_Min | Theory_Throughput/HTC |
1 | 1 | 119 | 119 (119/1) | 64.67 | 2.59 (119/46) | 115.00 |
8 | 2 | 580 | 73 (580/8) | 39.40 | 1.58 (73/46) | 80.50 |
12 | 3 | 654 | 55 (654/12) | 29.89 | 1.20 (55/46) | 57.50 |
16 | 4 | 730 | 46 (730/16) | 25.00 | 1.00 (46/46) | 46.00 |
Table-2
where Throughput_Based_CPU%:
(119/46)*25% = 64.67%
(73/46)*25% = 39.40%
(55/46)*25% = 29.89%
and Theory_Throughput/HTC based on linear interpolation:
46*(0.6250/0.25) = 115.00
46*(0.4375/0.25) = 80.50
46*(0.3125/0.25) = 57.50
Table-2 shows that the Theory_Throughput is close to tested Throughput. Thus the designated CPU usage is a calibrated, scalable metric (notice: "S" in POWER8 servers signifies Scale-out.).
Usually applications with a lot of transaction are benchmarked in terms of Throughput, hence binding Throughput linearly to CPU usage is a practical approach to assess application performance.
In principle, CPU usage represents the throughput, and its complement (1 - Usage) stands for the remaining available capacity. One process running in one core with CPU usage of 62.5% on first HTC stands for that there is still 37.5% available capacity on other 3 HTCs, each of which can share a portion of 12.5%.
In practice, CPU utilization can be applied as metric for charging back of computing resources used, and its complement can be used for predication of capacity planning.
This model of SMT CPU accounting is not widely acknowledged, and therefore caused confusion. For example, Oracle Note 1963791.1:
Unaccounted Time in 10046 File on AIX SMT4 Platform when Comparing Elapsed and CPU Time (Doc ID 1963791.1) [8]
where session trace records:
cpu time = 86.86 waited time = 7.06 elapsed time = 142.64
and the difference:
142.64 - (86.86 + 7.06) = 48.72 seconds,
is interpreted as "Unaccounted Time".
In fact,
86.86/142.64 = 60.90%,
indicates that almost a single Oracle session alone occupies one full core.
Blog [9] also reported the similar observation on AIX POWER7 and trying to explain the unaccounted time.
Probably people working in other UNIX (Solaris, HP-UX, Linux) gets used to intuitive interpretation of CPU time and elapsed time, but with the advancing of multi-threaded processors like AIX, an inception of rethinking would help disperse the confusion so that CPU resource can be efficiently allocated and accurately assessed.
3. POWER PURR
According to [2][4],
POWER5 includes a per-thread processor utilization of resources register (PURR), which increments at the timebase frequency multiplied by the fraction of cycles on which the thread can dispatch instructions.
Beginning with IBM® POWER5 TM processor architecture, a new register, PURR, is introduced to assist in computing the utilization.
The PURR stands for Processor Utilization of Resources Register and its available per Hardware Thread Context.
The PURR counts in proportion to the real time clock (timebase)
The SPURR stands for Scaled Processor Utilization of Resources Register.
The SPURR is similar to PURR except that it increments proportionally to the processor core frequency.
The AIX® lparstat, sar & mpstat utilities are modified to report the PURR-SPURR ratio via a new column, named "nsp".
and it demonstrates the enhanced command: time (timex), sar -P ALL, mpstat -s, lpstat -E
AIX provides command pprof with flag: -r PURR to report CPU usage in PURR time instead of TimeBase.
For example, start one CPU intensive Oracle session (process) in one core for a duration of 120 seconds:
exec xpp_test.run_job(p_case => 2, p_job_cnt => 1, p_dur_seconds => 120);
(see TestCase in Blog [7], POWER7, 4 Core, SMT=4)
in this case, single Core runs in smt=1, then tracking its PURR time for 100 seconds by:
pprof 100 -r PURR
and displaying the report by:
head -n 50 pprof.cpu
The output shows (irrelevant lines removed):
Pprof CPU Report
E = Exec'd F = Forked
X = Exited A = Alive (when traced started or stopped)
C = Thread Created
* = Purr based values
Pname PID PPID BE TID PTID ACC_time* STT_time STP_time STP-STT
===== ===== ===== === ===== ===== ========= ======== ======== ========
ora_j000_testdb 42598406 7864540 AA 21299317 0 62.930 0.037 99.805 99.768
If tracking with TimeBase by:
pprof 100
The output (head -n 50 pprof.cpu) looks like:
Pname PID PPID BE TID PTID ACC_time STT_time STP_time STP-STT
===== ===== ===== === ===== ===== ======== ======== ======== ========
ora_j000_testdb 1835064 0 AA 2687059 0 99.899 0.016 99.916 99.900
Continues with our example by starting 8 CPU intensive Oracle sessions (each Core runs in smt=2):
exec xpp_test.run_job(p_case => 2, p_job_cnt => 8, p_dur_seconds => 120);
and look PURR report for one Oracle process:
Pname PID PPID BE TID PTID ACC_time* STT_time STP_time STP-STT
===== ===== ===== === ===== ===== ========= ======== ======== ========
ora_j007_testdb 17760298 7864540 AA 57475195 0 42.910 0.340 99.210 98.870
then starting 12 CPU intensive Oracle sessions (each Core runs in smt=3):
exec xpp_test.run_job(p_case => 2, p_job_cnt => 12, p_dur_seconds => 120);
and look PURR report for one Oracle process:
Pname PID PPID BE TID PTID ACC_time* STT_time STP_time STP-STT
===== ===== ===== === ===== ===== ========= ======== ======== ========
ora_j007_testdb 33095898 7864540 AA 50135123 0 30.658 0.017 100.008 99.990
And finally starting 16 CPU intensive Oracle sessions (each Core runs in smt=4):
exec xpp_test.run_job(p_case => 2, p_job_cnt => 16, p_dur_seconds => 120);
and look PURR report for one Oracle process:
Pname PID PPID BE TID PTID ACC_time* STT_time STP_time STP-STT
===== ===== ===== === ===== ===== ========= ======== ======== ========
ora_j014_testdb 33488964 7864540 AA 73531621 0 24.673 0.143 99.145 99.002
We can see that ACC_time* correlates well with CPU% of Table-1. The little difference is probably due to single point of contention on Oracle latch: row cache objects - child: dc_users [7].
The study[4] shows that PURR limits its inaccuracy to 26% in the single core POWER5 configuration (CMP+SMT).
Additionally, in AIX, the number of physical processors consumed is reported by sar.physc and vmstat.pc, percentage of entitled capacity consumed is reported by sar.%entc and vmstat.ec.
By the way, Linux on POWER machine reads PURR at regular intervals and make the values available through a file in the procfs [4].
3.1 PURR APIs
AIX perfstat_cpu APIs include:
perfstat_cpu_total(), perfstat_partition_total(), and perfstat_cpu_total_wpar(), perfstat_cpu_util().perfstat_partition_total Subroutine retrieves global Micro-Partitioning® usage statistics into:
perfstat_partition_total_t lparstats;In libperfstat.h File[12], perfstat_partition_total_t contains the following PURR members:
purr_counter Number of purr cycles spent in user and kernel mode.
u_longlong_t puser Raw number of physical processor ticks in user mode.
u_longlong_t psys Raw number of physical processor ticks in system mode.
u_longlong_t pidle Raw number of physical processor ticks idle.
u_longlong_t pwait Raw number of physical processor ticks waiting for I/O.
perfstat_cpu_total Subroutine fills perfstat_cpu_total_t, which contains the following members:
u_longlong_t user Raw total number of clock ticks spent in user mode.
u_longlong_t sys Raw total number of clock ticks spent in system mode.
u_longlong_t idle Raw total number of clock ticks spent idle.
u_longlong_t wait Raw total number of clock ticks spent waiting for I/O.
(see File: libperfstat.h)perfstat_partition_total Interface demonstrates emulating the lparstat command including PCPU:
/* calculate physcial processor tics during the last interval in user, system, idle and wait mode */
delta_pcpu_user = lparstats.puser - last_pcpu_user;
delta_pcpu_sys = lparstats.psys - last_pcpu_sys;
delta_pcpu_idle = lparstats.pidle - last_pcpu_idle;
delta_pcpu_wait = lparstats.pwait - last_pcpu_wait;
Oracle'link to AIX perfstat_cpu_total is visible by:
$ nm -Xany $ORACLE_HOME/bin/oracle |grep perfstat_cpu_total
Name Type Value Size
------------------- ---- ---------- ----
.perfstat_cpu_total T 4580969124
.perfstat_cpu_total t 4580969124 44
perfstat_cpu_total U -
perfstat_cpu_total d 4861786384 8
Type:
T Global text symbol
t Local text symbol
U Undefined symbol
d Local data symbol
but it is not clear how Oracle calls perfstat_partition_total.4. Thread Scheduling and Throughput
Picking performance test data out of Blog [6] Table-3, and adding smt per Core, we got:
JOB_CNT | C1_RUN_CNT(Throughput) | MIN | MAX | Core_1_smt | Core_2_smt | Core_3_smt | Core_4_smt | Theory_Throughput |
1 | 118 | 118 | 118 | 1 | 0 | 0 | 0 | 118.00 |
2 | 240 | 120 | 120 | 1 | 1 | 0 | 0 | 236.00 |
3 | 360 | 120 | 120 | 1 | 1 | 1 | 0 | 354.00 |
4 | 461 | 109 | 120 | 1 | 1 | 1 | 1 | 472.00 |
5 | 476 | 74 | 119 | 2 | 1 | 1 | 1 | 519.20 |
6 | 515 | 75 | 97 | 2 | 2 | 1 | 1 | 566.40 |
7 | 551 | 66 | 100 | 2 | 2 | 2 | 1 | 613.60 |
8 | 569 | 63 | 77 | 2 | 2 | 2 | 2 | 660.80 |
9 | 597 | 58 | 76 | 3 | 2 | 2 | 2 | 672.60 |
10 | 601 | 56 | 73 | 3 | 3 | 2 | 2 | 684.40 |
11 | 613 | 48 | 67 | 3 | 3 | 3 | 2 | 696.20 |
12 | 646 | 49 | 64 | 3 | 3 | 3 | 3 | 708.00 |
13 | 683 | 47 | 66 | 4 | 3 | 3 | 3 | 719.80 |
14 | 696 | 46 | 65 | 4 | 4 | 3 | 3 | 731.60 |
15 | 714 | 45 | 51 | 4 | 4 | 4 | 3 | 731.60 |
16 | 733 | 44 | 47 | 4 | 4 | 4 | 4 | 755.20 |
Table-3
where Theory_Throughput is calculated based on above Table-1 and smt per Core, for example:
JOB_CNT=5, 118*1.4*1+118*3 = 519.2
JOB_CNT=11, 118*1.5*3+118*1.4*1 = 696.2
JOB_CNT=15, 118*1.6*2+118*1.5*2 = 731.6
Blog [3] reveals the particularity on non-existence of smt=3 mode, which said: when starting 9 processes (Oracle sessions) on a POWER7 4 Cores SMT=4, there will be 1 Cores running with 4 HTCs, and 1 Core having only 1 HTC, and 2 Cores with 2 HTCs in total, 9 HTCs for 9 Oracle sessions.
As we tested by running:
exec xpp_test.run_job(p_case => 2, p_job_cnt => 9, p_dur_seconds => 120);
"pprof 100 -r PURR" shows:
Pname PID PPID BE TID PTID ACC_time* STT_time STP_time STP-STT
===== ===== ===== === ===== ===== ========= ======== ======== ========
ora_j000_testdb 43646996 7864540 AA 65273961 0 43.578 0.008 97.823 97.815
ora_j001_testdb 14090250 7864540 AA 45744377 0 43.314 1.621 100.010 98.389
ora_j002_testdb 38338754 7864540 AA 28442745 0 40.696 0.007 100.000 99.993
ora_j003_testdb 33095926 7864540 AA 78119153 0 36.575 0.010 99.922 99.912
ora_j004_testdb 39583756 7864540 AA 35258545 0 36.204 2.242 97.824 95.582
ora_j005_testdb 42401958 7864540 AA 73662611 0 36.020 1.131 99.731 98.600
ora_j006_testdb 12976182 7864540 AA 68681805 0 35.944 2.212 99.912 97.700
ora_j007_testdb 49086646 7864540 AA 75563151 0 32.372 1.893 97.823 95.930
ora_j008_testdb 7602206 7864540 AA 32112699 0 31.676 1.893 97.823 95.930
"mpstat -ws" shows:
Proc0 Proc4
94.56% 99.92%
cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7
22.43% 24.13% 36.16% 11.84% 30.39% 20.72% 30.08% 18.72%
Proc8 Proc12
100.00% 100.00%
cpu8 cpu9 cpu10 cpu11 cpu12 cpu13 cpu14 cpu15
25.72% 24.80% 16.14% 33.35% 48.67% 46.49% 2.35% 2.49%
"sar -P ALL" shows:
cpu %usr %sys %wio %idle physc %entc
0 47 6 0 47 0.20 5.1
1 60 0 0 40 0.24 5.9
2 92 0 0 8 0.38 9.4
3 14 0 0 86 0.13 3.2
4 99 0 0 1 0.32 8.1
5 87 0 0 13 0.19 4.7
6 89 0 0 11 0.30 7.6
7 73 0 0 27 0.19 4.6
8 100 0 0 0 0.26 6.4
9 100 0 0 0 0.25 6.2
10 43 0 0 57 0.15 3.8
11 100 0 0 0 0.34 8.6
12 100 0 0 0 0.48 12.0
13 100 0 0 0 0.48 12.0
14 0 0 0 100 0.02 0.5
15 11 0 0 89 0.02 0.5
U - - 0 2 0.05 1.3
- 84 0 0 16 3.95 98.7
However, the above outputs shows that 2 HTCs (cpu14, cpu15) are almost idle, one (cpu3) has low workload, and the other 13 HTCs are more or less busy, probably only top 9 busy HTCs are for 9 running Oracle sessions.
Two idle HTCs (cpu14, cpu15) in Core 4 could also signify that 3 Cores running in smt=4 and one Core in smt=2.
Applying it to 15 processes (Oracle sessions), there will be 3 Cores running with 4 HTCs, and one Core having only 2 HTCs, in total, 14 HTCs for 15 Oracle sessions.
Let's test it by:
exec xpp_test.run_job(p_case => 2, p_job_cnt => 15, p_dur_seconds => 120);
"pprof 100 -r PURR" shows:
Pname PID PPID BE TID PTID ACC_time* STT_time STP_time STP-STT
===== ===== ===== === ===== ===== ========= ======== ======== ========
ora_j000_testdb 40108192 7864540 AA 15859877 0 27.389 0.020 100.003 99.983
ora_j001_testdb 11927686 7864540 AA 40697939 0 27.277 0.023 100.020 99.997
ora_j002_testdb 43647120 7864540 AA 30867661 0 26.892 0.657 100.003 99.346
ora_j003_testdb 17760278 7864540 AA 73531567 0 26.695 0.009 99.040 99.031
ora_j004_testdb 8388634 7864540 AA 60424235 0 26.576 0.003 100.013 100.011
ora_j005_testdb 33095868 7864540 AA 30801969 0 25.437 0.657 98.956 98.300
ora_j006_testdb 37290002 7864540 AA 26411087 0 25.278 0.656 97.135 96.478
ora_j007_testdb 44105940 7864540 AA 39190655 0 24.977 0.003 100.009 100.006
ora_j008_testdb 45285388 7864540 AA 72482871 0 24.773 0.735 100.013 99.279
ora_j009_testdb 33488948 7864540 AA 75628593 0 24.262 0.015 96.875 96.860
ora_j010_testdb 13828318 7864540 AA 67240145 0 24.256 0.016 96.876 96.860
ora_j011_testdb 42926108 7864540 AA 64421997 0 24.233 0.024 96.874 96.850
ora_j012_testdb 42336484 7864540 AA 32112833 0 24.180 0.016 96.876 96.860
ora_j013_testdb 38600812 7864540 AA 47972531 0 24.049 0.734 96.874 96.140
ora_j014_testdb 46727262 7864540 AA 25690295 0 24.047 0.734 96.874 96.140
"mpstat -s" shows:
Proc0 Proc4
99.99% 100.00%
cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7
21.33% 28.15% 24.69% 25.81% 25.00% 25.03% 24.98% 24.99%
Proc8 Proc12
100.00% 100.00%
cpu8 cpu9 cpu10 cpu11 cpu12 cpu13 cpu14 cpu15
25.03% 25.02% 24.66% 25.29% 25.92% 24.12% 24.95% 25.02%
"sar -P ALL" shows:
cpu %usr %sys %wio %idle physc %entc
0 72 5 0 23 0.21 5.3
1 94 0 0 6 0.28 7.1
2 91 0 0 9 0.24 6.1
3 95 0 0 5 0.26 6.5
4 99 0 0 1 0.25 6.2
5 100 0 0 0 0.25 6.3
6 100 0 0 0 0.25 6.3
7 100 0 0 0 0.25 6.2
8 100 0 0 0 0.25 6.3
9 100 0 0 0 0.25 6.3
10 99 0 0 1 0.25 6.2
11 100 0 0 0 0.25 6.3
12 100 0 0 0 0.25 6.4
13 98 0 0 2 0.25 6.1
14 100 0 0 0 0.25 6.2
15 100 0 0 0 0.25 6.3
- 97 0 0 3 4.00 100.0
The above outputs deliver no evidence of non-existence of smt=3 mode. It could be possible that I missed some points here. It will be interesting to see how to demonstrate it.
5. AIX Internal Code
AIX "struct procinfo" used in getprocs Subroutine (/usr/include/procinfo.h) contains a comment on pi_cpu:
struct procinfo
{
/* scheduler information */
unsigned long pi_pri; /* priority, 0 high, 31 low */
unsigned long pi_nice; /* nice for priority, 0 to 39 */
unsigned long pi_cpu; /* processor usage, 0 to 80 */
Probably in the mind of AIX developers, process processor usage is not allowed to be over 80.In fact, in one POWER8 with SMT=2, maximum CPU Utilization of 76% is observed.
All fields in Oracle procsinfo (not AIX struct "procinfo") can be listed by:
$ nm -Xany $ORACLE_HOME/bin/oracle |grep -i procsinfo
procsinfo:T153=s1568
pi_pid:121,0,32;
pi_ppid:121,32,32;
...
pi_cpu:123,448,32;
...
pi_utime:123,864,32;
...
pi_stime:123,928,32;
...
6. vpm_throughput_mode
This AIX scheduler tunable parameter specifies the desired level of SMT exploitation for scaled throughput mode.
A value of 0 gives default behavior (raw throughput mode).
A value of 1, 2, or 4 selects the scaled throughput mode and the desired level of SMT exploitation. It is the number of threads used by one core before using next core.
schedo –p –o vpm_throughput_mode=
0 Legacy Raw mode (default)
1 Enhanced Raw mode with a higher threshold than legacy
2 Scaled mode, use primary and secondary SMT threads
4 Scaled mode, use all four SMT threads
Raw Mode (0, 1)
provides the highest per-thread throughput and best response times at the expense of activating more physical core. For example, Legacy Raw mode (default) dispatches workload to all primary threads before using any secondary threads.
Secondary threads are activated when the load of all primary threads is over certain utilization, probably 50%, and new workload (process) comes to be dispatched for running.
3rd and 4th threads are activated when the load of secondary threads is over certain utilization, probably 20%, and new workload (process) comes to be dispatched for running.
Scaled Mode (2, 4)
intends the highest per-core throughput (in the specified mode: 2 or 4) at the expense of per-thread response times and throughput. For example, Scaled mode 2 dispatches workload to both primary and secondary threads of one core before using those of next core. Scaled mode 4 dispatches workload to all 4 threads of one core before using those of next core.
In Scaled mode 2, 1st and 2nd threads of each core are bound together, thus both have the similar workload (CPU Usage). 3rd and 4th threads are activated when the load of 1st and 2nd threads is over certain utilization, probably 30%, and new workload (process) comes to be dispatched for running.
Note that this tuning intention is per active core, not all cores in the LPAR. In fact, it is aimed at activating less cores. It would be a setting conceived for a test system with a few LPARs.
Referring to Table-1, vpm_throughput_mode = 2 is corresponding to smt = 2, two threads are running per core, Throughput/HTC = 0.7, CPU% = 43.75. In real applications with Scaled mode 2, we also observed that CPU% is constrained under 43% even if runqueue is shorter than number of cores. That means even though workload is low, CPU% can not score up to its maximum of 62.50, and applications can not benefit from the maximum Throughput/HTC. For the performance critical application, Scaled mode is questionable. On the contrary, Raw Mode automatically tunes the CPU% based on workload. That is probably why vpm_throughput_mode is in default set to 0.
We can see there is no vpm_throughput_mode=3. Probably it is related to Blog [3] mentioned the particularity on non-existence of smt=3 mode.
There is also a naming confusion. In default, POWER7 runs in "Legacy Raw mode", and POWER6 behaves like "scaled throughput mode". Normally "Legacy" means it was used in some previous model or release, but here POWER6 uses something like "Scaled mode, and a later model (POWER7) introduced a "Legacy" mode 0.
7. NMON Report
NMON report contains three aspects of worksheets on CPU usage: PCPU_ALL (PCPUnnn), SCPU_ALL (SCPUnnn), CPU_ALL (CPUnnn).
AIXpert Blog[10] said:
If I had to guess then the Utilisation numbers in our PCPU_ALL graph (above) have been scaled from 75 cores to roughly 62 cores so "show" some SMT threads are unused so the CPU cores are not fully used (and given enough threads it could give you more performance). Roughly 10 - 15% more. Now, in my humble opinion, this is totally the wrong way of doing this as it is just plain confusing.
The PCPU and SCPU stats where (in my humble opinion) a confusing mistake and only useful if you have the CPUs in Power saving mode i.e. its changing the CPU GHz to save electrical power.
and IBM developerWorks Forum[11] described:
PCPU_ALL is the actual physical resource consumption. It would be in units of cores.
SCPU_ALL is the scaled physical resource consumption. Differs from PCPU_ALL if running at non-nominal frequency.
Again in units of cores. SCPU, PCPU do not differ when the system runs in the nominal frequency.
CPU_ALL: PhysicalCPU tag (0.376) denotes the fraction of core used by this partition.
The distribution of the 0.376 across various modes (user, sys, wait, idle) is proportional to the CPU_ALL% in all modes.
Applying this % would give the PCPU_ALL.
In short, PCPU_ALL represents PURR, SCPU_ALL for SPURR; and CPU_ALL denotes PCPU_ALL modes (user, sys, wait, idle) in percentage, and sum of them should be around 100%.
PCPUnnn represents CPU% of one single HTC (logical CPU, see Table-1), PCPU_ALL is sum of all PCPUnnn across various modes (user, sys, wait, idle).
In case of smt=2 (only two HTCs per core are activated), at each time instance, sum of user and sys in PCPUnnn should be under 43.75%; for each core, sum of user and sys should be under 87.5% (2*43.75); whereas for whole LPAR, sum of user and sys in PCPU_ALL should be under number_of_core * 87.50%.
In case of smt=4 (all 4 HTCs per core are activated), at each time instance, sum of user and sys in PCPUnnn should be under 25.00%; for each core, sum of user and sys should be under 100.00% (4*25.00); whereas for whole LPAR, sum of user and sys in PCPU_ALL should be under number_of_core * 100.00%.
In TOP worksheet, %CPU, %Usr, %Sys also represent PURR. Note that if Threads > 1, they are the sum of all Threads aggregated by PID, hence can be more than 80%.
In Oracle AWR report, %CPU is in PURR too.
8. Power8
(a). SMT=4, for single HTC (smt=1), CPU% = 60.00% instead of 62.50% in Power7 (see Table-1).
The Throughput Ratio of smt=1 vs. smt=4 is 60.00/25.00 = 2.4 instead of 2.5 in Power7,
that means about 4% (2.5/62.5 = 0.1/2.5 = 0.04) less than Power7.
(b). SMT=8, for single HTC (smt=1), CPU% = 56.00%
The Throughput Ratio of smt=1 vs. smt=8 is 56.00/12.50 = 4.48.
The above preliminary figures needs to be further verified.
POWER8 Each core has 16 execution pipelines[13]:
2 fixed-point pipelines 2 load-store pipelines 2* load pipelines (no results to store) 4* double-precision floating-point pipelines, which can also act as eight single-precision pipelines 2* fully symmetric vector pipelines with support for VMX and VSX AltiVec instructions. 1 decimal floating-point pipeline 1* cryptographic pipeline (AES, Galois Counter Mode, SHA-2) 1 branch execution pipeline 1 condition register logical pipeline Note: All units different from POWER7 are marked by "*". POWER7 Core has 12 execution units, POWER8 16.
9. Conclusion
This Blog presented the POWER7 model of CPU usage and throughput, and examined with real cases. Accurately modelling leads to not only fruitful system tuning and trustful performance assessment, but also fairly charging back and economically resource utilizing (e.g. Power Saver Mode). As coming study, we will investigate the applicability and eventually adaptation of the model on new POWER8 (SMT=8), and the future POWER9.
References
1. POWER7
2. Understanding CPU Utilization on AIX
3. Local, Near & Far Memory part 3 - Scheduling processes to SMT & Virtual Processors
4. P. Mackerras, T. S. Mathews, and R. C. Swanberg. Operating System Exploitation of the POWER5 System.
IBM J. Res. Dev., 49(4/5):533–539, July 2005.
5. CPU accounting in multi-threaded processors
6. java stored procedure calls and latch: row cache objects, and performance
7. java stored procedure calls and latch: row cache objects
8. Unaccounted Time in 10046 File on AIX SMT4 Platform when Comparing Elapsed and CPU Time (Doc ID 1963791.1)
Bug 13354348 : UNACCOUNTED GAP BETWEEN ELAPSED TO CPU TIME ON 11.2 IN AIX
Bug 16044824 - UNACCOUNTED GAP BETWEEN ELAPSED AND CPU TIME FOR DB 11.2 ON PLATFORM AIX POWER7
Bug 18599013 : NEED TO CALCULATE THE UNACCOUNTED TIME FOR A TRACE FILE
Bug 7410881 : HOW CPU% UTILIZATION COLLECTED ON AIX VIA EM
Bug 15925194 : AIX COMPUTING METRICS INCORRECTLY
9. Oracle on AIX - where's my cpu time ?
10. nmon CPU graphs - Why are the PCPU_ALL graphs lower?
11. dW:AIX and UNIX:Performance Tools Forum:Nmon - PCPU_ALL
12. libperfstat.h
13. POWER8