Sun Java Solaris Communities My SDN Account Join SDN

Article

Analyzing Process Behavior in the Solaris OS

 
By Adrian Cockcroft, June 1999  

How to locate and unclog system bottlenecks

Q: My system is running slow, but I can't find anything wrong. How do I locate and fix the problem?

A: Your first step should be to look at what is running on the system in more detail. Once you identify which processes are active, you can narrow it down to the one(s) causing the delay and determine what kind of resource is bottlenecking the process.

Process information sources

Process tools have been pretty well represented in the pages of SunWorld. I've explained the common tools and underlying operating system interfaces in "What`s the best way to probe processes?" and described an SE toolkit extension for process and workload monitoring in "Prying into processes and workloads." Peter Galvin, SunWorld's resident systems expert, examined process tools in detail in his April 1999 column, "The power of /proc." And finally, I covered the latest version of Proctool in last`s month`s column.

The nice thing about the SE toolkit process and workload monitor is that it provides the Solaris 2 microstate accounting and resource usage data, which you can use to identify bottlenecks. The following data is provided by the SE toolkit process_class.se. The SE toolkit is based on a C language interpreter, so you can obtain this information as you would a normal C data structure.


	/* output data for specified process */
	long      flags; /* latest pr_flag for the process */
	double    interval;  /* measured time interval for averages */
	double    timestamp; /* last time process was measured */
	double    creation;  /* process start time */
	double    termination; /* process termination time stamp */
	double    elapsed; /* elapsed time for all lwps in process */
	double    total_user; /* current totals in seconds */
	double    total_system;
        double    total_sampled_cpu;
	double    total_child;
	double    user_time; /* user time in this interval */
	double    system_time; /* system call time in this interval */
	double    sampled_cpu; /* total sampled cpu time */
	double    trap_time; /* system trap time in interval */
	double    child_time; /* child CPU in this interval */
	double    text_pf_time; /* text page fault wait in interval */
	double    data_pf_time; /* data page fault wait in interval */
	double    kernel_pf_time; /* kernel page fault wait in interval */
	double    user_lock_time; /* user lock wait in interval */
	double    sleep_time; /* all other sleep time */
	double    cpu_wait_time; /* time on runqueue waiting for CPU */
	double    stoptime; /* time stopped from ^Z */
	ulong	  syscalls; /* syscall/interval for this process */
	ulong     inblocks; /* input blocks/interval */
	ulong	  outblocks; /* output blocks/interval */
	ulong	  vmem_size; /* size in KB */
	long	  vmem_delta; /* size change in KB */
	ulong	  rmem_size; /* RSS in KB */
#ifdef XMAP
	ulong     pmem_size; /* private mem in KB */
	ulong     smem_size; /* shared mem in KB */
#endif
	ulong     maj_faults; /* majf/interval */
	ulong  	  min_faults; /* minf/interval */
	ulong  	  total_swaps; /* swapout count */
	long      priority; /* current sched priority */
	long      niceness; /* current nice value */
	char      sched_class[PRCLSZ]; /* name of class */
	ulong     messages; /* msgin+msgout/interval */
	ulong     signals; /* signals/interval */
	ulong     vcontexts; /* voluntary context switches/interval */
	ulong     icontexts; /* in-ditto */
	ulong     charios; /* characters in and out/interval */
	ulong     lwp_count; /* number of lwps for the process */
	int       uid; /* current uid */
	long      ppid; /* parent pid */
 	int       bindpset; /* processor set binding */
	char      fname[PRFNSZ]; /* last components of exec'd pathname */
	char     args[PRARGSZ]; /* initial part of arg list */

The SE toolkit also includes a workload summarization class that adds all the processes matching a given criteria. Matching can be performed by user name, process command name, command arguments, and, in the very latest version, by the processor set binding of the process. I described processor sets back in May of last year. The workload class reports exactly the same set of data shown above, but the totals are the sum of the matching processes.

Tools for interpreting process -- Room for improvement

While these tools provide a lot of numbers to look at, none of them interpret the numbers to tell you where problems exist, or what you can do about them. In fact, it's quite rare for any performance tools to include generic process monitoring rules. Application monitoring is provided by products like BMC Patrol and Foglight RAPS. (Foglight, formerly Resolute Software, is where SE toolkit author Rich Pettit works.) These two products monitor specific named application processes and log files.

What I'm interested in is something more generic like the analysis BMC Best/1 provides from its capacity planning model, which determines how much of a workload's response time is made up of consuming or waiting for resources like CPU, disk, or network devices. To analyze date using Best/1, you collect data first and do the analysis offline. I, however, want to analyze the data in realtime using the microstate data as additional input.

Back to Top


Process rules

The data provided by process_class.se is measured in intervals, so we can report on what we think that process is doing during a particular interval by interpreting the data.

Child CPU Activity rule
When a process forks a child, and that child subsequently exits, the CPU time consumed by the child process is accumulated into the child CPU total for the parent. So if this metric is nonzero, then a child process has exited, and we can infer that this process is forking and reaping child processes.

Process Stopped Time rule
There is a microstate counter for the amount of time that a process is in the stopped state, which occurs when a stop signal is sent to the process, or when you manually enter Ctrl-Z at the shell. The stopped state also occurs during debug and when truss is running (when another process takes control of this process to analyze it, the process is suspended for some of the time).

Back to Top


Memory Shortage rules
It is important to know if a process is suffering from a shortage of memory. SE provides three ways to determine memory use: a microstate counter for the time spent waiting for page faults in executable code, and counting the total number of page faults and swap-outs.

I/O Bottleneck rules
The time spent waiting for page faults for non-executable data pages is counted separately and indicates I/O activity. The number of characters read and written in read/write system calls is also available, as is the number of blocks input and output. However, the block count data is not updated except for filesystem metadata transfers; a bugfix to make the block data more useful by including all data transfers in currently in the works.

CPU Bottleneck rules
The microstate measures of user and system CPU time don't miss anything, so if they are zero, the process has definitely not run in the interval, and an accurate list of all the active processes in the interval can be constructed. The normal CPU time counters sample every 10 microseconds and catch only processes that are running when the clock interrupt occurs, as described in my column on CPU time measurement errors. There is also a microstate measure of the amount of time that the process was ready to run, but was waiting in the CPU run queue for a CPU to become available. If this value reaches a high enough level, you definitely have a CPU bottleneck.

Back to Top


SE also has a measure of the number of voluntary and involuntary context switches. An involuntary context switch occurs when either a higher priority process has taken over the CPU or when the process has used up an entire time slice and had its priority reduced as a consequence. Involuntary context switches indicate that there is some contention for the CPU, and a bottleneck exists. Voluntary context switches, by contrast are a normal part of the operation of the process and occur when the process blocks in a system call to wait for something else to complete.

Process Size Growing
Most long-lived processes grow their address space quickly when they first start up, but then they stabilize. If a process' virtual memory size continues to grow, some kind of memory leak is probable, so watch them carefully.

Multithreaded Process Activity
Multithreaded processes can consume more than 100 percent of CPU time and can also report more than 100 percent of wait time. Every thread that is blocked on a semaphore will report via the microstate user lock wait time. This can be confusing, as a process can accumulate an enormous amount of wait time in a report. Wait time is another word for idle time, so this simply means that there are a bunch of idle threads, and no resources are being consumed apart from the memory needed to hold the thread data structures and stack space. Because the Java runtime environment is multithreaded, it is quite common for this to occur with Java programs.

Back to Top


Process rule implementation

I have implemented these rules in SE. The input is a process or workload class data structure containing all the information shown above. Each rule has a threshold and an explanation string that indicates what it is for. Like the other rules I've written in SE, I use environment variables to set up modified thresholds, and each threshold is defined in an initialized C data structure that also indicates how to format the data for display. I also report a status code, a short action string, and a longer explanation string as outputs from each rule.

The status codes are defined in SE as:


#define ST_WHITE  0  /* Nothing happening - virginal */
#define ST_BLUE   1  /* Unhappy - blue - neglected when other things are 
                     busy -- imbalance of resource use */
#define ST_GREEN  2  /* No worries, mate (Australian accent :-) */
#define ST_AMBER  3  /* Achtung (German accent :-) */
#define ST_RED    4  /* Au secours (French accent :-) */
#define ST_BLACK  5  /* Mayday (M'aidez) -- panic level */

The structure definition for double precision thresholds is shown below; a similar structure is used to define integer and string threshold types.


struct rule_thresh_dbl { 
  string env; /* environment variable definition */
  double def; /* default value */
  string units; /* units for printing */
  int    width; /* field width for printing */
  int    prec;  /* precision, as in printf %width.prec */
  string comment; /* meaningful label or explanation */
  };

The threshold definitions all use this structure type. In most cases, I have set the thresholds to zero, so the rule warns on any activity at all. I chose this approach for two reasons: to allow me to test that the code works by having every rule fire and also to determine what threshold value provides a good balance between missing a problem completely and setting off many false alarms. I think it is best to make the rule over-sensitive at the beginning, tuning out warnings later on.

Back to Top


Child CPU Activity rule
The threshold is zero, and its units are percentages, which is time relative to the measurement interval. For example, if you set the threshold at 10 percent and the interval was 60 seconds, the rule would fire if the total child CPU time increased by 6 seconds. When a long-running, CPU-intensive process exits, the shell that spawned it will accumulate a huge amount of child CPU time in one go, so large numbers are not a sign of a problem. This rule is very useful to identify processes that spawn a lot of small, short-lived sub-commands, which are hard to catch with a regular process monitor.

The threshold definition and rule code are shown below. I've omitted the code that surrounds the core rule itself. If the rule fires (this applies to all the rules), one word is appended to the action string, one line is appended to the explanation string, and the state code is set to be at least as bad a state (that is, if the state was already red, it would not be reset down to amber).

To increase the threshold, use setenv PROC_CHILD_WARN 10.0, which will be picked up automatically by any script that uses this rule.


/* child_time - set thresh to zero to start with */ 
rule_thresh_dbl proc_child_warn = {"PROC_CHILD_WARN", 0.0, "%", 4,
	1, "process is forking and reaping child processes"};

		/* check for forking child_time */
		if (100.0 * p.child_time/p.interval > child_warn) {
      			if (state < ST_AMBER) {
				state = ST_AMBER;
			}
        		tmpaction = strcat(tmpaction, " forking");
        		explanation = sprintf("%s: %s\n", 
                		state_string(ST_AMBER),
				proc_child_warn.comment);
        		strncat(tmpexp, explanation, 512);
		}

Back to Top


Process Stopped Time rule
This rule also has a zero threshold and very similar code. There seems to be a bug in the way microstate data is accumulated for some kernel threads, and the fsflush and pageout processes accumulate all their time in the stopped state. The best way to workaround this problem is to ignore process id 2 and 3.



/* stopped_time */ 
rule_thresh_dbl proc_stop_warn = {"PROC_STOP_WARN", 0.0, "%", 4,
        1, "process is ^Z stopped or under truss/debug control"};

		/* stoptime - ignore pageout 2 and fsflush 3 */
                if (p.pid$ != 2 && p.pid$ != 3 &&
			            100.0 * p.stoptime/p.interval >   stop_warn) {
                        if (state < ST_AMBER) {
                                state = ST_AMBER;
                        }
                        tmpaction = strcat(tmpaction, " stopped");
                        explanation = sprintf("%s: %s\n",
                                state_string(ST_AMBER),
                                proc_stop_warn.comment);
                        strncat(tmpexp, explanation, 512);
                }

Back to Top


Memory Shortage rules
This breaks down into two separate rules: one combines major faults and the text page fault wait state, and the other just reports that the process has just been swapped. Swapping happens to idle processes, so when memory is short you would expect to see one process reporting a memory shortage and different processes reporting that they have been swapped out to make space.


/* memory shortage - text_pf_time maj_faults swaps */
rule_thresh_dbl proc_textpf_warn = {"PROC_TEXTPF_WARN", 0.0, "%", 4,
        1, "process is delayed by page faults in executable code"};
rule_thresh_dbl proc_majf_warn = {"PROC_MAJPF_WARN", 0.0, "pf/s", 4,         1, "process is taking major page faults"};
rule_thresh_dbl proc_swap_warn = {"PROC_SWAP_WARN", 0.0, "swaps", 4,
        1, "process swapped out"};

The code is very similar to the previous rules, apart from the combined test for a memory shortage, so I haven't included the code for the swap rule.


/* text_pf_time maj_faults */
        if (100.0 * p.text_pf_time/p.interval > textpf_warn &&
    p.maj_faults/p.interval > majf_warn) {
               if (state < ST_AMBER) {
                       state = ST_AMBER;
               }
               tmpaction = strcat(tmpaction, " memory");
               explanation = sprintf("%s: %s\n",
                      state_string(ST_AMBER),
                      proc_textpf_warn.comment);
               strncat(tmpexp, explanation, 512);
        }


Back to Top


I/O Bottleneck rule
This rule introduces two new thresholds, but uses the page fault threshold from the memory rule, as well. The character I/O threshold is set at 100 KB/s by default. "Anding" together all three of these values may be too restrictive. I'm not really sure how well this rule works yet, but I am sure that these are the right metrics to watch.


/* I/O bound - data_pf_time maj_faults charios */
rule_thresh_dbl proc_datapf_warn = {"PROC_DATAPF_WARN", 0.0, "%", 4,
        1, "process is delayed by page faults in data file access"}; rule_thresh_dbl proc_charios_warn = {"PROC_CHARIOS_WARN", 100.0, 
	"KB/s", 5, 1, "process is doing significant I/O"}; 

            /* data_pf_time maj_faults charios*/
                    if (100.0 * p.data_pf_time/p.interval > datapf_warn &&
                           p.maj_faults/p.interval > majf_warn &&
                p.charios/p.interval > charios_warn) {
                           if (state < ST_AMBER) {   
                                   state = ST_AMBER;
                           }
                           tmpaction = strcat(tmpaction, " IObound");
                           explanation = sprintf("%s: %s\n",   
                                  state_string(ST_AMBER), 
                                  proc_datapf_warn.comment);
                           strncat(tmpexp, explanation, 512);
                   }


Back to Top


CPU Bottleneck rules
I decided not to bother writing a rule that examines CPU usage and tells you that the process is using the CPU, as that seems too obvious and already well understood. A tool may choose to display only processes that have used some CPU time in the interval, or have a warning state set (idle and swapped out, for example), but it is not considered a problem to use some CPU time!


/* CPU bound - user_time system_time cpu_wait_time */ 
/* vcontexts icontexts */
rule_thresh_dbl proc_cpu_wait_warn = {"PROC_CPU_WAIT_WARN", 5.0,
     "%", 4, 1, "process is waiting for a CPU to become free"}; 
rule_thresh_dbl proc_icontexts_warn = {"PROC_ICONTEXTS_WARN", 10.0,
     "ctx/s", 4, 1, "process is being pre-empted by other jobs"};

The rule code treats these as two separate conditions, and it is similar to previous examples so I won't bore you by showing it again. I use a five percent threshold for CPU wait time; if it's much more than five percent, the process will be slowed enough to notice. A low level of involuntary context switches always occur in normal operation, so I set a threshold of 10 per second to start with. This value probably needs to be increased.

Back to Top


Process Size Growing
This rule combines two thresholds: the process must be at least 10 MB in size, and it must be growing at a rate of 8 KB/s. Again, these are fairly low limits; to tune the rule, you need to know how big the processes on your system are expected to get. Remember, this rule looks at the virtual memory size, not the amount of RAM in use by the process, which can go up and down.


/* Process growing maybe leaking - vmem_size */
rule_thresh_dbl proc_vmem_warn = {"PROC_VMEM_WARN", 10.0, "MB", 5,
        1, "process is large"}; 
rule_thresh_dbl proc_vmem_rate_warn = {"PROC_VMEM_RATE_WARN", 8.0,
    "KB/s", 4, 1, "process is growing"}; 

Multithreaded Process Activity
This rule simply identifies which processes have extra idle threads using a zero threshold. It could also look at the number of lightweight processes (lwps) in the process, but for now I've left it simple.


/* stalled threads - lwp_count user_lock_time */
rule_thresh_dbl proc_user_lock_warn = {"PROC_USER_LOCK_WARN", 0.0,
     "%", 4, 1, "process threads waiting on user lock"};

Back to Top


Test Programs

I've built a couple of test programs: peer.se looks at every process on the system and prints out the status and action string; pry.se looks at a single process and prints out the multiline explanation string, as well. Both are based on the code for pea.se, my basic process monitor. They could easily be extended to use combined workload data as input, although some of the thresholds may have to be adjusted, depending upon how many processes are being accumulated into the workload.

Below is a sample of the output from peer.se. I ran a copy of find in another window, and in this interval find had just exited. The process class reports that the process has gone away as deadproc 1 at the bottom. Because find has exited, its CPU time has been accumulated into the parent shell, which is shown with the correct rule status as forking, with 58.79 percent child CPU time. The peer.se script is running on the interpreter se.sparc.5.6, and as you can see, it uses about 15 percent of an 85-MHz microSPARC to collect the data and run the rule on 84 processes every 10 seconds. This is fast enough for me, although there are several places where the process_class in particular could be made more efficient. The process rule is not adding much CPU time, as pea.se also uses about 15 percent of the CPU -- the difference is less than one percent.


10:53:54 name lwmx   pid  ppid   uid   usr%   sys% wait% chld%   size   rss   pf
csh              1   344   341  9506   0.01   0.01  0.07 58.79   1068   636  0.0 
  amber: forking
se.sparc.5.6     1  2766  2618  9506  14.98   0.87  8.82  0.00   2272  2000  0.0 
  amber: Invctx
find             1  2776   344  9506  11.06  12.98  0.00  0.00    844   668  0.0 
  amber: stopped Invctx Growing
nproc 84  newproc 0  deadproc 1

The second program looks at a single process, so I started a copy of find, noted its pid, then fired up pry.se on that pid. Later, I stopped it and then killed it. This program uses the wide-screen output format with a lot of extra data.


% se pry.se 2886
...
monpid 2886  nproc 85  newproc 83  deadproc 0
11:33:30 name lwmx   pid  ppid   uid   usr%   sys% wait% chld%   size   rss   pf  inblk outblk chario   sysc   vctx   ictx   msps
find             1  2886   344  9506  19.30  20.69  0.00  0.00    840   668  6.0  16.36   0.07  37430      0  27.05  21.98   8.15 
  amber: Invctx Growing

amber: process is being pre-empted by other jobs
amber: process is growing

monpid 2886  nproc 85  newproc 0  deadproc 0
11:33:40 name lwmx   pid  ppid   uid   usr%   sys% wait% chld%   size   rss   pf  inblk outblk chario   sysc   vctx   ictx   msps
find             1  2886   344  9506   3.82   5.01  1.15  0.00    840   668  8.4   4.73   3.19   7124    191  13.33   5.22   4.75 
  amber: Invctx

amber: process is being pre-empted by other jobs

monpid 2886  nproc 85  newproc 0  deadproc 0
11:33:50 name lwmx   pid  ppid   uid   usr%   sys% wait% chld%   size   rss   pf  inblk outblk chario   sysc   vctx   ictx   msps
find             1  2886   344  9506   0.00   0.00  0.00  0.00    840   668  0.0   0.00   0.00      0      0   0.00   0.00   0.00 
  amber: stopped

amber: process is ^Z stopped or under truss/debug control

monpid 2886  nproc 84  newproc 0  deadproc 1
Process has terminated
Wrap up

I've folded this new code into the next build of SE 3.1 prefcs, although this rule is still at the experimental stage. Please let me know if you find it useful or can think of ways to improve it. I'd like to see rules like this in all the commercial tools, so that the cause of problems can be identified more easily.


Resources

Recommended reading

Additional resources

Related articles and FAQs in SunWorld

Other SunWorld resources


Adrian Cockcroft joined Sun Microsystems in 1988, and currently works as a performance specialist for the Computer Systems Division of Sun. He wrote Sun Performance and Tuning: SPARC and Solaris and Sun Performance and Tuning: Java and the Internet, both published by Sun Microsystems Press Books.

Back to Top


Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.