How to locate and unclog system bottlenecksQ: My system is running slow, but I can't find anything wrong. How do I locate and fix the problem? A: Your first step should be to look at what is running on the system in more detail. Once you identify which processes are active, you can narrow it down to the one(s) causing the delay and determine what kind of resource is bottlenecking the process. Process information sourcesProcess tools have been pretty well represented in the pages of SunWorld. I've explained the common tools and underlying operating system interfaces in "What`s the best way to probe processes?" and described an SE toolkit extension for process and workload monitoring in "Prying into processes and workloads." Peter Galvin, SunWorld's resident systems expert, examined process tools in detail in his April 1999 column, "The power of /proc." And finally, I covered the latest version of Proctool in last`s month`s column. The nice thing about the SE toolkit process and workload monitor is that it provides the Solaris 2 microstate accounting and resource usage data, which you can use to identify bottlenecks. The following data is provided by the SE toolkit process_class.se. The SE toolkit is based on a C language interpreter, so you can obtain this information as you would a normal C data structure.
/* output data for specified process */
long flags; /* latest pr_flag for the process */
double interval; /* measured time interval for averages */
double timestamp; /* last time process was measured */
double creation; /* process start time */
double termination; /* process termination time stamp */
double elapsed; /* elapsed time for all lwps in process */
double total_user; /* current totals in seconds */
double total_system;
double total_sampled_cpu;
double total_child;
double user_time; /* user time in this interval */
double system_time; /* system call time in this interval */
double sampled_cpu; /* total sampled cpu time */
double trap_time; /* system trap time in interval */
double child_time; /* child CPU in this interval */
double text_pf_time; /* text page fault wait in interval */
double data_pf_time; /* data page fault wait in interval */
double kernel_pf_time; /* kernel page fault wait in interval */
double user_lock_time; /* user lock wait in interval */
double sleep_time; /* all other sleep time */
double cpu_wait_time; /* time on runqueue waiting for CPU */
double stoptime; /* time stopped from ^Z */
ulong syscalls; /* syscall/interval for this process */
ulong inblocks; /* input blocks/interval */
ulong outblocks; /* output blocks/interval */
ulong vmem_size; /* size in KB */
long vmem_delta; /* size change in KB */
ulong rmem_size; /* RSS in KB */
#ifdef XMAP
ulong pmem_size; /* private mem in KB */
ulong smem_size; /* shared mem in KB */
#endif
ulong maj_faults; /* majf/interval */
ulong min_faults; /* minf/interval */
ulong total_swaps; /* swapout count */
long priority; /* current sched priority */
long niceness; /* current nice value */
char sched_class[PRCLSZ]; /* name of class */
ulong messages; /* msgin+msgout/interval */
ulong signals; /* signals/interval */
ulong vcontexts; /* voluntary context switches/interval */
ulong icontexts; /* in-ditto */
ulong charios; /* characters in and out/interval */
ulong lwp_count; /* number of lwps for the process */
int uid; /* current uid */
long ppid; /* parent pid */
int bindpset; /* processor set binding */
char fname[PRFNSZ]; /* last components of exec'd pathname */
char args[PRARGSZ]; /* initial part of arg list */
The SE toolkit also includes a workload summarization class that adds all the processes matching a given criteria. Matching can be performed by user name, process command name, command arguments, and, in the very latest version, by the processor set binding of the process. I described processor sets back in May of last year. The workload class reports exactly the same set of data shown above, but the totals are the sum of the matching processes. Tools for interpreting process -- Room for improvementWhile these tools provide a lot of numbers to look at, none of them interpret the numbers to tell you where problems exist, or what you can do about them. In fact, it's quite rare for any performance tools to include generic process monitoring rules. Application monitoring is provided by products like BMC Patrol and Foglight RAPS. (Foglight, formerly Resolute Software, is where SE toolkit author Rich Pettit works.) These two products monitor specific named application processes and log files. What I'm interested in is something more generic like the analysis BMC Best/1 provides from its capacity planning model, which determines how much of a workload's response time is made up of consuming or waiting for resources like CPU, disk, or network devices. To analyze date using Best/1, you collect data first and do the analysis offline. I, however, want to analyze the data in realtime using the microstate data as additional input. Process rulesThe data provided by process_class.se is measured in intervals, so we can report on what we think that process is doing during a particular interval by interpreting the data. Child CPU Activity rule Process Stopped Time rule Memory Shortage rules I/O Bottleneck rules CPU Bottleneck rules SE also has a measure of the number of voluntary and involuntary context switches. An involuntary context switch occurs when either a higher priority process has taken over the CPU or when the process has used up an entire time slice and had its priority reduced as a consequence. Involuntary context switches indicate that there is some contention for the CPU, and a bottleneck exists. Voluntary context switches, by contrast are a normal part of the operation of the process and occur when the process blocks in a system call to wait for something else to complete. Process Size Growing Multithreaded Process Activity Process rule implementationI have implemented these rules in SE. The input is a process or workload class data structure containing all the information shown above. Each rule has a threshold and an explanation string that indicates what it is for. Like the other rules I've written in SE, I use environment variables to set up modified thresholds, and each threshold is defined in an initialized C data structure that also indicates how to format the data for display. I also report a status code, a short action string, and a longer explanation string as outputs from each rule. The status codes are defined in SE as:
#define ST_WHITE 0 /* Nothing happening - virginal */
#define ST_BLUE 1 /* Unhappy - blue - neglected when other things are
busy -- imbalance of resource use */
#define ST_GREEN 2 /* No worries, mate (Australian accent :-) */
#define ST_AMBER 3 /* Achtung (German accent :-) */
#define ST_RED 4 /* Au secours (French accent :-) */
#define ST_BLACK 5 /* Mayday (M'aidez) -- panic level */
The structure definition for double precision thresholds is shown below; a similar structure is used to define integer and string threshold types.
struct rule_thresh_dbl {
string env; /* environment variable definition */
double def; /* default value */
string units; /* units for printing */
int width; /* field width for printing */
int prec; /* precision, as in printf %width.prec */
string comment; /* meaningful label or explanation */
};
The threshold definitions all use this structure type. In most cases, I have set the thresholds to zero, so the rule warns on any activity at all. I chose this approach for two reasons: to allow me to test that the code works by having every rule fire and also to determine what threshold value provides a good balance between missing a problem completely and setting off many false alarms. I think it is best to make the rule over-sensitive at the beginning, tuning out warnings later on. Child CPU Activity rule The threshold definition and rule code are shown below. I've omitted the code that surrounds the core rule itself. If the rule fires (this applies to all the rules), one word is appended to the action string, one line is appended to the explanation string, and the state code is set to be at least as bad a state (that is, if the state was already red, it would not be reset down to amber). To increase the threshold, use setenv PROC_CHILD_WARN 10.0, which will be picked up automatically by any script that uses this rule.
/* child_time - set thresh to zero to start with */
rule_thresh_dbl proc_child_warn = {"PROC_CHILD_WARN", 0.0, "%", 4,
1, "process is forking and reaping child processes"};
/* check for forking child_time */
if (100.0 * p.child_time/p.interval > child_warn) {
if (state < ST_AMBER) {
state = ST_AMBER;
}
tmpaction = strcat(tmpaction, " forking");
explanation = sprintf("%s: %s\n",
state_string(ST_AMBER),
proc_child_warn.comment);
strncat(tmpexp, explanation, 512);
}
Process Stopped Time rule
/* stopped_time */
rule_thresh_dbl proc_stop_warn = {"PROC_STOP_WARN", 0.0, "%", 4,
1, "process is ^Z stopped or under truss/debug control"};
/* stoptime - ignore pageout 2 and fsflush 3 */
if (p.pid$ != 2 && p.pid$ != 3 &&
100.0 * p.stoptime/p.interval > stop_warn) {
if (state < ST_AMBER) {
state = ST_AMBER;
}
tmpaction = strcat(tmpaction, " stopped");
explanation = sprintf("%s: %s\n",
state_string(ST_AMBER),
proc_stop_warn.comment);
strncat(tmpexp, explanation, 512);
}
Memory Shortage rules
/* memory shortage - text_pf_time maj_faults swaps */
rule_thresh_dbl proc_textpf_warn = {"PROC_TEXTPF_WARN", 0.0, "%", 4,
1, "process is delayed by page faults in executable code"};
rule_thresh_dbl proc_majf_warn = {"PROC_MAJPF_WARN", 0.0, "pf/s", 4, 1, "process is taking major page faults"};
rule_thresh_dbl proc_swap_warn = {"PROC_SWAP_WARN", 0.0, "swaps", 4,
1, "process swapped out"};
The code is very similar to the previous rules, apart from the combined test for a memory shortage, so I haven't included the code for the swap rule.
/* text_pf_time maj_faults */
if (100.0 * p.text_pf_time/p.interval > textpf_warn &&
p.maj_faults/p.interval > majf_warn) {
if (state < ST_AMBER) {
state = ST_AMBER;
}
tmpaction = strcat(tmpaction, " memory");
explanation = sprintf("%s: %s\n",
state_string(ST_AMBER),
proc_textpf_warn.comment);
strncat(tmpexp, explanation, 512);
}
I/O Bottleneck rule
/* I/O bound - data_pf_time maj_faults charios */
rule_thresh_dbl proc_datapf_warn = {"PROC_DATAPF_WARN", 0.0, "%", 4,
1, "process is delayed by page faults in data file access"}; rule_thresh_dbl proc_charios_warn = {"PROC_CHARIOS_WARN", 100.0,
"KB/s", 5, 1, "process is doing significant I/O"};
/* data_pf_time maj_faults charios*/
if (100.0 * p.data_pf_time/p.interval > datapf_warn &&
p.maj_faults/p.interval > majf_warn &&
p.charios/p.interval > charios_warn) {
if (state < ST_AMBER) {
state = ST_AMBER;
}
tmpaction = strcat(tmpaction, " IObound");
explanation = sprintf("%s: %s\n",
state_string(ST_AMBER),
proc_datapf_warn.comment);
strncat(tmpexp, explanation, 512);
}
CPU Bottleneck rules
/* CPU bound - user_time system_time cpu_wait_time */
/* vcontexts icontexts */
rule_thresh_dbl proc_cpu_wait_warn = {"PROC_CPU_WAIT_WARN", 5.0,
"%", 4, 1, "process is waiting for a CPU to become free"};
rule_thresh_dbl proc_icontexts_warn = {"PROC_ICONTEXTS_WARN", 10.0,
"ctx/s", 4, 1, "process is being pre-empted by other jobs"};
The rule code treats these as two separate conditions, and it is similar to previous examples so I won't bore you by showing it again. I use a five percent threshold for CPU wait time; if it's much more than five percent, the process will be slowed enough to notice. A low level of involuntary context switches always occur in normal operation, so I set a threshold of 10 per second to start with. This value probably needs to be increased. Process Size Growing
/* Process growing maybe leaking - vmem_size */
rule_thresh_dbl proc_vmem_warn = {"PROC_VMEM_WARN", 10.0, "MB", 5,
1, "process is large"};
rule_thresh_dbl proc_vmem_rate_warn = {"PROC_VMEM_RATE_WARN", 8.0,
"KB/s", 4, 1, "process is growing"};
Multithreaded Process Activity
/* stalled threads - lwp_count user_lock_time */
rule_thresh_dbl proc_user_lock_warn = {"PROC_USER_LOCK_WARN", 0.0,
"%", 4, 1, "process threads waiting on user lock"};
Test ProgramsI've built a couple of test programs: peer.se looks at every process on the system and prints out the status and action string; pry.se looks at a single process and prints out the multiline explanation string, as well. Both are based on the code for pea.se, my basic process monitor. They could easily be extended to use combined workload data as input, although some of the thresholds may have to be adjusted, depending upon how many processes are being accumulated into the workload. Below is a sample of the output from peer.se. I ran a copy of find in another window, and in this interval find had just exited. The process class reports that the process has gone away as deadproc 1 at the bottom. Because find has exited, its CPU time has been accumulated into the parent shell, which is shown with the correct rule status as forking, with 58.79 percent child CPU time. The peer.se script is running on the interpreter se.sparc.5.6, and as you can see, it uses about 15 percent of an 85-MHz microSPARC to collect the data and run the rule on 84 processes every 10 seconds. This is fast enough for me, although there are several places where the process_class in particular could be made more efficient. The process rule is not adding much CPU time, as pea.se also uses about 15 percent of the CPU -- the difference is less than one percent. 10:53:54 name lwmx pid ppid uid usr% sys% wait% chld% size rss pf csh 1 344 341 9506 0.01 0.01 0.07 58.79 1068 636 0.0 amber: forking se.sparc.5.6 1 2766 2618 9506 14.98 0.87 8.82 0.00 2272 2000 0.0 amber: Invctx find 1 2776 344 9506 11.06 12.98 0.00 0.00 844 668 0.0 amber: stopped Invctx Growing nproc 84 newproc 0 deadproc 1 The second program looks at a single process, so I started a copy of find, noted its pid, then fired up pry.se on that pid. Later, I stopped it and then killed it. This program uses the wide-screen output format with a lot of extra data. % se pry.se 2886 ... monpid 2886 nproc 85 newproc 83 deadproc 0 11:33:30 name lwmx pid ppid uid usr% sys% wait% chld% size rss pf inblk outblk chario sysc vctx ictx msps find 1 2886 344 9506 19.30 20.69 0.00 0.00 840 668 6.0 16.36 0.07 37430 0 27.05 21.98 8.15 amber: Invctx Growing amber: process is being pre-empted by other jobs amber: process is growing monpid 2886 nproc 85 newproc 0 deadproc 0 11:33:40 name lwmx pid ppid uid usr% sys% wait% chld% size rss pf inblk outblk chario sysc vctx ictx msps find 1 2886 344 9506 3.82 5.01 1.15 0.00 840 668 8.4 4.73 3.19 7124 191 13.33 5.22 4.75 amber: Invctx amber: process is being pre-empted by other jobs monpid 2886 nproc 85 newproc 0 deadproc 0 11:33:50 name lwmx pid ppid uid usr% sys% wait% chld% size rss pf inblk outblk chario sysc vctx ictx msps find 1 2886 344 9506 0.00 0.00 0.00 0.00 840 668 0.0 0.00 0.00 0 0 0.00 0.00 0.00 amber: stopped amber: process is ^Z stopped or under truss/debug control monpid 2886 nproc 84 newproc 0 deadproc 1 Process has terminated Wrap upI've folded this new code into the next build of SE 3.1 prefcs, although this rule is still at the experimental stage. Please let me know if you find it useful or can think of ways to improve it. I'd like to see rules like this in all the commercial tools, so that the cause of problems can be identified more easily. Resources Recommended reading
Additional resources
Related articles and FAQs in SunWorld
Other SunWorld resources
Adrian Cockcroft joined Sun Microsystems in 1988, and currently works as a performance specialist for the Computer Systems Division of Sun. He wrote Sun Performance and Tuning: SPARC and Solaris and Sun Performance and Tuning: Java and the Internet, both published by Sun Microsystems Press Books. | ||||||||
|
| ||||||||||||