Problem:
Recently we hit some performance problem on a system running a heavy Oracle application. Occasionally when more than 100 "end of day" batch jobs are automatically submitted by a Job control system in the early morning, no more new users can login into Oracle (alert log shows many lines of ORA-609: that is, could not attach to incoming connection).
By only concentrating on Oracle performance, it could not figure out any particular clue.
The already logged in users and applications can still work, Oracle workload is not high.
Then we asked UNIX administrator to send us some snapshots of vmstat(AIX NMON report even better), immediately it turns out that Runqueues, Kernel thread context switches, system cpu utilization are extremely high in comparing to normal operation.
Here are two entries of vmstat, one is in normal time, another is during batch (problem) time.
----- ------------ --------- --------
r sy cs us sy id hr mi se
37 250806 7412 34 2 64 09:53:39 << normal time
185 87914 34261 12 80 8 02:52:44 << batch time
Analysis:
The application is running on high end IBM AIX server with many cores(Physical CPUs).
When 100 batch jobs are submitted, most of them can be scheduled to "Running" state,
however, there is a single config file to be read, hence inherently creating a single point of contention (bottleneck). IO is normal not fast as CPU. The faster the CPU, the higher contention.
Here is what AIX default Scheduler (SCHED_OTHER) would do. When a "New" process is inserted into system, it is first put into "Ready". Then scheduler brings it to "Running" state, in this case, it has to wait IO, thus switches to "Preempted" state, and later back again to "Ready" state, eventually waits scheduler to bring into "Running" state again. Thus it yields CPU to other processes. Since all these processes have the same characteristics, they will go through many such life cycles. Therefore system ends up spending majority of time in frequent Kernel thread context switches instead of applications work. This phenomenon of "thrashing" causes high system CPU utilization because of single file competing.
As tools, vmstat and sar are used for CPU, filemon for file IO. In AIX, nmon report could provide much rich information. By using "ps -efl" (or Berkeley Standards "ps glew"), one can see many <defunct> and <exiting> processes (more than 50 each) during problem time. This is because batchload forks loadconf into background and it quits before children processes terminate, and children processes becomes "orphan".
When monitoring the CPU performance, it is better to check CPU utilization instead of RunQueue since CPU utilization is an average over an interval, whereas RunQueue is a sampling (for example in vmstat), normally by a frequency of 100 microseconds, however, today’s CPU ticks in nanosecond. Thus RunQueue can not be very precise. For instance, submitting 100 jobs does not mean RunQueue showing 100. In fact, Nyquist sampling theorem requires the sampling should be double fast.
has a profound discussion on run queues and demonstrates how to
tracks the insertion of runnable threads onto CPU run queues with a DTrace script.
with Runqueues and context switches. That is still not clear to me.
Fix:
A quick fix is to insert a sleep after sending each job so that CPU power is mitigated to reduce the IO burden.
Test Codes:
- Script-1 creates a config file, which looks like JAVA's properties file.
- Script-2 reads the config file.
- Script-3 launches N batch jobs in background.
batchload 100
and at same time running vmstat on a second session.
Script-1 confcreate
# usage: confcreate
count=0
max=1000
configfile="conf"
while [[ $count -lt $max ]]
do
print "name$count=val$count" >> $configfile
count=$((count + 1))
done
#!/bin/ksh
# usage: loadconf
count=0
max=1000
configfile="conf"
while [[ $count -lt $max ]]
do
val=$(awk -F"=" "/^(name$count)=/"'{print $2}' conf)
count=$((count + 1))
# some Sqlplus script to do the business
done
#!/bin/ksh
# usage: . ./batchload 100
count=0
max=$1
while [[ $count -lt $max ]]
do
./loadconf &
print batchload $count at $(date)
# sleep 2 # sleep to relent CPU
count=$((count + 1))
done