|
By Jamie Wilson, June 2003
|
|
|
Understanding and Preventing Systems Slowdowns
Summary
Has increased demand caused a shortage of resources on your server? Are customers complaining about
slow response times? In these days of exponential network growth, keeping up with demand can be a difficult
challenge. Jamie Wilson explains what you can do to analyze your current resource demands, and gives tips on
planning for future growth. (2,100 words)
It's a phone call most administrators never want to receive. "The server is slow, no one can check email.
Web pages are loading slowly, or not at all!" Too often administrators find themselves trying to climb up the
steep slope of increased demand. As a user base grows, the demand placed on the server grows as well. This growth
may be linear and predictable, or it may be completely random or exponential.
There are ways to avoid the angry phone call altogether. Understanding system bottlenecks and gathering statistical data can
help you project your system's current and future needs. This can eliminate user complaints -- and prevent that phone from ringing.
What causes a bottleneck?
Why does a system slow down in the first place? Slowdowns can usually be attributed to one or more bottlenecks,
which are caused when part of the system is not running fast enough to keep up with the demands placed on it.
The most common bottlenecks occur for the following reasons:
- Slow disks or disk arrays aren't able to handle I/O requests quickly enough
- The system is starved for memory, so applications are forced to swap to disk, which can slow response
- The system is out of processor power
- The network interface is overloaded
So how can you tell which of these systems may be having a problem? By using the various tools of the capacity
planning trade: sar, netstat, lockstat, and top
sar
sar is by far one of the most valuable tools an administrator has to track past trends and predict
future demand. sar is only installed by default with the full distribution of Solaris. Verify that
sar is installed on your system:
pkginfo -l SUNWaccu
If it's not currently installed, you can add it by installing SUNWaccu.
Once sar is installed, you'll need to configure it to begin collecting data. First, edit the system's
crontab:
crontab -e sys
Remove the comments so that you have these lines:
0 * * * 0-6 /usr/lib/sa/sa1
20,40 8-17 * * 1-5 /usr/lib/sa/sa1
5 18 * * 1-5 /usr/lib/sa/sa2 -s 8:00 -e 18:01 -i 1200 -A
Then vi /etc/init.d/perf.
Remove the comments below Uncomment the following lines.
This will enable sar for system-activity reporting. You may also want to increase sar's log retention:
vi /usr/lib/sa/sa2
/usr/bin/find /var/adm/sa ( -name 'sar*' -o -name 'sa*'
) -mtime +30 -exec /usr/bin/rm {} ;
Your system will now begin gathering data. For a detailed explanation of how to use sar, please see the
sar man pages. Here is a quick list of sar's more useful features:
sar run with no options shows CPU usage
sar -q shows your average queue size
sar -p and sar -g show paging activity
sar -d shows disk utilization
sar -f reads a previously saved file, sar -f /var/adm/sa/sa03
Back to Top
netstat
One of sar's shortcomings is that it will not trend network traffic for you. This can be done using
netstat. netstat -in will show you your network interfaces, how much traffic they have
passed since booting, and any problems with them.
netstat -in
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs
Collis Queue
hme1 1500 192.168.100.0 192.168.100.1 1477758588 0
2897473608 0 0 0
hme2 1500 192.168.101.0 192.168.101.1 3228181693
157415 3365694030 0 0 0
From this example, you can see that hme1 and hme2 are very busy, with hme2
having seen some incoming errors on its interface.
lockstat
With Solaris 2.6 and up, Sun included a utility called lockstat, which can show you what is causing
kernel locking. The lockstat man pages are available for more information. Here is one example of how
to use this utility:
lockstat sleep 30 > /tmp/lock.out
more /tmp/lock.out
Callers with the most lock counts may be causing problems. If you see hmestart or qfestart
causing many kernel locks, you may need to add another network interface.
top
top is not installed with Solaris, but it is invaluable tool that offers a realtime snapshot of what's
happening on the system; you can download it from http://www.sunfreeware.com.
top will show you how much memory is free on the system, and which processes are using the most CPU or
memory resources.
So where's the slowdown?
Using tools such as sar, netstat, and lockstat can help you determine where a
slowdown might be happening, or where one is about to happen. Here are some examples of how you can
use these tools:
sar with no options. This will show how idle the CPUs are. If your CPUs are using a lot
of %usr or %sys, you may have to add extra CPUs to deal with increased demand. If %wio
is high, your system is waiting for your I/O subsystems to catch up. You may have a slow disk or array.
sar -g. If you have many pgscan/s, your system is swapping. No swapping is
the only good swapping. Your system is probably short on memory. Use sar -r to verify this.
netstat -in. Look to see if an interface is overloaded with traffic. If so, you may have
to add another physical interface. Also, look for Ierrs, Oerrs, and Collis. These should all be
relatively low numbers if not zero. High numbers in these columns can indicate network problems, such as speed or duplex
autonegotiation issues, bad cabling, or a bad switch port.
top. If all else fails, look at top. What process is taking up the most resources?
Back to Top
Analyze the data and make recommendations
So you've put together all of your reporting tools. You're able to do past trend analysis and future growth predictions
based on sar. You can also do realtime snapshots using top. What should you do to make the
system perform better now, as well as in the future?
It's very important to note that if you do identify and solve a bottleneck, your solution can potentially cause even
worse problems. For example, if you have idle CPU and a busy disk, replacing the busy disk with a fast disk can cause
the CPU usage to spike. Remember, capacity planning is a constant exercise, not a one-time activity. Here are some
scenarios:
- Busy I/O subsystems. Say you've determined by using
sar -d that one or more of your disks
is very busy (more than 90 percent busy). Either move I/O from that disk to a faster disk or array, or split up the I/O
amongst many arrays, depending on the data. Remember also that SCSI interfaces can be overloaded as well. This is difficult
to determine, but it's a good idea to add new SCSI interfaces and balance I/O traffic accordingly. Improving I/O access
can have a major impact on CPU or network performance.
- Busy CPUs. Using
sar, it may become apparent that your system is in heavy
%usr and %sys. You may also want to use mpstat to see more information
about your CPUs. Adding CPUs in this situation can help, but it may not solve the problem. A poorly written
application can consume infinite amounts of CPU resources.
- Busy network.
netstat -in and lockstat may show your network interface
to be very busy. Add another physical interface, but beware of increased I/O and CPU demands. Is the system swapping?
Add more memory. Do whatever you can to prevent the system from swapping. If possible, create swap on fast disks.
Application slowdowns
Sometimes system hardware isn't the problem at all. Remember that applications are what consume system resources,
and poorly written applications can be very difficult to deal with. Here are some bits of advice:
- Beware of single-threaded applications. While a single-threaded application is generally easier
to develop, it's also more costly to run. Many applications developed in-house are single-threaded. The worst example
is the single-threaded nonforking application. This is an application that's not only single-threaded, but also won't
fork copies of itself to consume resources more efficiently.
top will only show one instance of this
daemon running. ps -eLf will only show one thread. This can be a very challenging application, as it may
only consume a single CPU even if you add more CPUs. Single-threaded applications that fork copies of themselves are
much easier to deal with, but still are not as efficient as a multithreaded application.
- Learn as much as possible about the application you're dealing with. Talk to the vendors or the
authors, because they'll know what tricks and tips will work best. Often, entries need to be made in
/etc/system
so that an application can work at peak capacity. ndd settings may also need to be tweaked based on your
current needs. Consider all of these performance suggestions before adding new hardware.
Back to Top
Planning for future capacity
Sometimes the best way to plan for the future is to look at your past performance data. Using sar,
you can ascertain a trend in the resource consumption on your system. If your system CPU was 90 percent idle three
months ago, and now it's 80 percent idle, it's not unreasonable to assume that in three months your system will only
have 70 percent idle CPU. Some parts of your system may grow at exponential rates, such as I/O or network subsystems.
That's why it's important to constantly gather data, so you can see where you've been and where you're going. You may
also want to consider writing scripts that can monitor sar and alert you when certain thresholds are
reached. If your I/O is 70 percent busy for more than a week, it's probably time to consider a replacement or an
upgrade.
Communication within your own organization can help you meet future capacity as well. You need to know if your marketing
department is planning a big push to acquire more customers, or if a new accounting system is going into place next week.
Growth is then predictable, as you can plan for increased access to your database or for exponential growth in your
Web server's traffic. Knowing how your customers will be using your servers will help you provide better
performance.
Scaling horizontally and vertically
For large-scale applications, it's extremely important to be able to scale your systems both horizontally and vertically.
Horizontal scaling allows you to add many boxes to serve the same application, while vertical scaling allows you to break
the application into pieces so that each one can be scaled horizontally. A system designed to be both horizontally and
vertically scalable allows you to add servers as demand increases. This way, you avoid the pitfalls of trying to scale
one big box, and can benefit from having many small boxes.
Here are some examples of horizontal and vertical scaling:
- Horizontal Web servers. Multiple Web servers are set up serving identical content, using independent
hardware on different networks. DNS round robin or load balancing can be used.
- Horizontal and vertical email solutions. Each component of the email server (mx, SMTP, POP, Web mail)
can be run on its own independent server. Multiple individual servers can be set up to balance the load. In this way,
you can have four mx servers, two SMTP servers, two POP servers, and one Web mail server, or whatever configuration
you need to meet demand.
- Horizontal and vertical Web servers. Multiple Web servers can be set up -- some that serve
graphics, and others that serve just CGI scripts. Servers can be added as demand increases.
Staying ahead of the curve
Using reporting tools such as sar makes it possible to identify trends on your system. Learning
about the applications on your system and communicating with your organization can also help when planning future
growth. Finally, designing a system that can scale both horizontally and vertically can help you stay one step ahead
of the growth curve.
Resources
Reprinted with permission from the December 2000 edition of
Unix Insider.
|
|