On Oracle: High system time utilization on fast machine

Problem:

Recently we hit some performance problem on a system running a heavy Oracle application. Occasionally when more than 100 "end of day" batch jobs are automatically submitted by a Job control system in the early morning, no more new users can login into Oracle (alert log shows many lines of ORA-609: that is, could not attach to incoming connection).

By only concentrating on Oracle performance, it could not figure out any particular clue.
The already logged in users and applications can still work, Oracle workload is not high.

Then we asked UNIX administrator to send us some snapshots of vmstat(AIX NMON report even better), immediately it turns out that Runqueues, Kernel thread context switches, system cpu utilization are extremely high in comparing to normal operation.

Here are two entries of vmstat, one is in normal time, another is during batch (problem) time.

kthr faults cpu time

----- ------------ --------- --------

r sy cs us sy id hr mi se

37 250806 7412 34 2 64 09:53:39 << normal time

185 87914 34261 12 80 8 02:52:44 << batch time

At the end of this Blog, a complete simplified test codes are appended and can be tested on AIX and Solaris.

Analysis:

The application is running on high end IBM AIX server with many cores(Physical CPUs).
When 100 batch jobs are submitted, most of them can be scheduled to "Running" state,
however, there is a single config file to be read, hence inherently creating a single point of contention (bottleneck). IO is normal not fast as CPU. The faster the CPU, the higher contention.

Here is what AIX default Scheduler (SCHED_OTHER) would do. When a "New" process is inserted into system, it is first put into "Ready". Then scheduler brings it to "Running" state, in this case, it has to wait IO, thus switches to "Preempted" state, and later back again to "Ready" state, eventually waits scheduler to bring into "Running" state again. Thus it yields CPU to other processes. Since all these processes have the same characteristics, they will go through many such life cycles. Therefore system ends up spending majority of time in frequent Kernel thread context switches instead of applications work. This phenomenon of "thrashing" causes high system CPU utilization because of single file competing.

The problem occurs only when system baseload (the load when launching batch jobs) achieves certain high.

As tools, vmstat and sar are used for CPU, filemon for file IO. In AIX, nmon report could provide much rich information. By using "ps -efl" (or Berkeley Standards "ps glew"), one can see many <defunct> and <exiting> processes (more than 50 each) during problem time. This is because batchload forks loadconf into background and it quits before children processes terminate, and children processes becomes "orphan".

When monitoring the CPU performance, it is better to check CPU utilization instead of RunQueue since CPU utilization is an average over an interval, whereas RunQueue is a sampling (for example in vmstat), normally by a frequency of 100 microseconds, however, today’s CPU ticks in nanosecond. Thus RunQueue can not be very precise. For instance, submitting 100 jobs does not mean RunQueue showing 100. In fact, Nyquist sampling theorem requires the sampling should be double fast.

On Solaris, paper (http://www.solarisinternals.com/wiki/images/2/28/Technocrat-util.pdf)
has a profound discussion on run queues and demonstrates how to
tracks the insertion of runnable threads onto CPU run queues with a DTrace script.

One thing I noticed is that "System calls" in vmstat does not necessarily coincide
with Runqueues and context switches. That is still not clear to me.

This Blog also shows that finding the root cause is paramount for Oracle Performance troubleshooting.

Fix:

A quick fix is to insert a sleep after sending each job so that CPU power is mitigated to reduce the IO burden.

Test Codes:

Script-1 creates a config file, which looks like JAVA's properties file.
Script-2 reads the config file.
Script-3 launches N batch jobs in background.

Test can be performed by calling on one UNIX session:
batchload 100
and at same time running vmstat on a second session.

Script-1 confcreate

#!/bin/ksh

# usage: confcreate

count=0

max=1000

configfile="conf"

while [[ $count -lt $max ]]

print "name$count=val$count" >> $configfile

count=$((count + 1))

done

Script-2 loadconf

#!/bin/ksh

# usage: loadconf

count=0

max=1000

configfile="conf"

while [[ $count -lt $max ]]

val=$(awk -F"=" "/^(name$count)=/"'{print $2}' conf)

count=$((count + 1))

# some Sqlplus script to do the business

done

Script-3 batchload

#!/bin/ksh

# usage: . ./batchload 100

count=0

max=$1

while [[ $count -lt $max ]]

./loadconf &

print batchload $count at $(date)

# sleep 2 # sleep to relent CPU

count=$((count + 1))

done

Tuesday, April 24, 2012

High system time utilization on fast machine

Problem:

Analysis:

Fix:

Test Codes: