Monday, April 23, 2012

High system time utilization on fast machine

Problem:


Recently we hit some performance problem on a system running a heavy Oracle application. Occasionally when more than 100 "end of day" batch jobs are automatically submitted by a Job control system in the early morning, no more new users can login into Oracle (alert log shows many lines of ORA-609: that is, could not attach to incoming connection).

By only concentrating on Oracle performance, it could not figure out any particular clue.
The already logged in users and applications can still work, Oracle workload is not high.

Then we asked UNIX administrator to send us some snapshots of vmstat(AIX NMON report even better),  immediately it turns out that Runqueues, Kernel thread context switches, system cpu utilization are extremely high in comparing to normal operation.

Here are two entries of vmstat, one is in normal time, another is during batch (problem) time.

kthr  faults         cpu        time



----- ------------  ---------  --------
  r       sy     cs  us sy id   hr mi se
 37   250806   7412  34  2 64   09:53:39  << normal time
185    87914  34261  12 80  8   02:52:44  << batch  time

At the end of this Blog, a complete simplified test codes are appended and can be tested on AIX and Solaris.

Analysis:


The application is running on high end IBM AIX server with many cores(Physical CPUs).
When 100 batch jobs are submitted, most of them can be scheduled to "Running" state,
however, there is a single config file to be read, hence inherently creating a single point of contention (bottleneck). IO is normal not fast as CPU. The faster the CPU, the higher contention.

Here is what AIX default Scheduler (SCHED_OTHER) would do. When a "New" process is inserted into system, it is first put into "Ready". Then scheduler brings it to "Running" state, in this case, it has to wait IO, thus switches to "Preempted" state, and later back again to "Ready" state, eventually waits scheduler to bring into "Running" state again.  Thus it yields CPU to other processes. Since all these processes have the same characteristics,  they will go through many such life cycles. Therefore system ends up spending majority of time in frequent Kernel thread context switches instead of applications work. This phenomenon of "thrashing" causes high system CPU utilization because of single file competing.

The problem occurs only when system baseload (the load when launching batch jobs) achieves certain high.

As tools, vmstat and sar are used for CPU, filemon for file IO. In AIX, nmon report could provide much rich information. By using "ps -efl" (or Berkeley Standards "ps glew"), one can see many <defunct> and <exiting> processes (more than 50 each) during problem time. This is because batchload forks loadconf into background and it quits before children processes terminate, and children processes becomes "orphan".

When monitoring the CPU performance, it is better to check CPU utilization instead of RunQueue since CPU utilization is an average over an interval, whereas RunQueue is a sampling (for example in vmstat),  normally by a frequency of 100 microseconds, however, today’s CPU ticks in nanosecond. Thus RunQueue can not be very precise. For instance, submitting 100 jobs does not mean RunQueue showing 100. In fact, Nyquist sampling theorem requires the sampling should be double fast.

On Solaris, paper (http://www.solarisinternals.com/wiki/images/2/28/Technocrat-util.pdf)
has a profound discussion on run queues and demonstrates how to
tracks the insertion of runnable threads onto CPU run queues with a DTrace script.
One thing I noticed is that "System calls" in vmstat does not necessarily coincide
with Runqueues and context switches. That is still not clear to me.
This Blog also shows that finding the root cause is paramount for Oracle Performance troubleshooting.

Fix:


A quick fix is to insert a sleep after sending each job so that CPU power is mitigated to reduce the IO burden.

Test Codes:

  1.  Script-1 creates a config file, which looks like JAVA's properties file.
  2.  Script-2 reads the config file.
  3.  Script-3 launches N batch jobs in background.
Test can be performed by calling on one UNIX session:
      batchload 100
and at same time running vmstat on a second session.

Script-1 confcreate

#!/bin/ksh
#    usage:  confcreate
count=0
max=1000
configfile="conf"

while [[ $count -lt $max ]]
do
  print "name$count=val$count" >> $configfile
  count=$((count + 1))
done

Script-2 loadconf

#!/bin/ksh
#    usage:  loadconf
count=0
max=1000
configfile="conf"

while [[ $count -lt $max ]]
do
 val=$(awk -F"=" "/^(name$count)=/"'{print $2}' conf)
  count=$((count + 1))
# some Sqlplus script to do the business 
done

Script-3 batchload


#!/bin/ksh
#    usage:   . ./batchload 100
count=0
max=$1

while [[ $count -lt $max ]]
do
 ./loadconf &
 print batchload $count at $(date)
#  sleep 2                            # sleep to relent CPU
  count=$((count + 1))
done