Abstract: This article introduces Contents:
An important feature making the Note that by application we mean any software other than the Solaris kernel. This includes middleware such as Mozilla, StarOffice software, various components of GNOME, OpenGL graphics, application servers, and others. The article is intended for software engineers working at independent software vendors (ISVs), as well as for system administrators and end users working with Solaris applications. As an example, consider a large facility with many users running several multi-process and/or multi-tier applications. What steps are taken if one of the application processes crashes? In the best of situations, the end user will recognize the specific inputs and conditions needed to reproduce the problem, and the application development team will be able to quickly reproduce and fix the problem. Unfortunately, in some cases the user will only be able to correctly specify some of the input and environment details, will incorrectly describe others, and won't be able to specify some of them at all. First, the software developers will need to discover the basic details needed to start the debugging process. These details may include the traceback (also known as backtrace or stack trace) of the failed application, memory map, environment variables active in the process, the OS version, patches installed, available swap space, and more. Manually gathering this information for every failure is costly, time consuming and error prone, and may be extremely difficult for sporadic cases.Automatic and semiautomatic solutions exist for some systems (see the section Solutions on Other Systems below), but they may create security and privacy problems because the user has little or no control of the information gathered and forwarded to the application and system vendors. For example, see reference [1]. Core Files: Historically, failure analysis on UNIX systems has been based on collecting and inspecting the core file created when an application crashes. (Note: Application "core dump", "coredump", "core file", and "corefile" all mean the same thing.) However, the ways applications are created and used have dramatically changed since UNIX was created in early 1970s. Typical application size and memory requirements are orders of magnitude larger than they were then. The users are not the same people as the application programmers; many end users do not know what a debugger is or how to use one. Due to the ubiquity of the Internet, security and privacy are much more serious concerns now than they were then. Currently, the Core File method suffers from several types of problems:
Interpose Libraries: Another technical approach to dealing
with application crashes is to implement an interpose library that
would install its own signal handlers for
Application Signal Handlers: Of course, it is also possible to install signal handlers for Some ISVs have handled these signals for years and done in their signal handlers whatever works for them. However, this approach does not work for most applications. We need a way to handle application crashes system-wide, without changing any application in any way.
For example, the following sequence invokes an application in the background, with
application_invocation &
pid=$!
truss -t \!all -m\!all -s \!all -S segv,bus -p $pid
if kill -0 $pid ; then
pmap $pid
pstack $pid
prun $pid
fi
The main problem with the The solution we are proposing in this article uses the new Solaris facility, DTrace, to watch for application crashes and to process each with a user-supplied reporting script. DTrace is a powerful facility introduced in the Solaris 10 OS for
kernel and application performance tuning and debugging (see reference [6] and the Our implementation of this solution consists of the following DTrace script. (Note: Please save file without app_crash.d This is combined with a user-defined shell script such as the following template. (Note: Please save file without runme_on_app_crash While individual users can use Once the
The described design using the environment variable For example, a particular ISV may wish to install DTrace script Once the users have collected all the necessary information, they
can review it, make sure that it contains no sensitive information, and
then send it to their application vendor. If desired, they can even
automate this emailing as a part of the script (see commented lines at
the end of shell script Note that by itself the collected information may not be adequate
for the ISV or Sun engineers to resolve the problem. No one can
guarantee that the problem will be solved based on this information
only. It is always best for the users to come up with a reproducible
test case and provide that to their application software vendor.
Nevertheless, the information collected with a script like If the users see question marks instead of the function names in the
traceback, they can send the information to the ISV anyway. The
application owners should be able to restore the actual function names,
for example using a tool called unstrip_traceback More possibilities are enabled by
Implementing either or both of the above suggestions would help further reduce the costs of software defects. Note that any automatic system to analyze multiple crash reports will require an agreed-upon standard defining what each report should contain and in what format. This standard can vary for different applications and user sites. The users and ISVs will still have full control over the gathered information. The involved parties will just need to coordinate it. Running the DTrace scripts like <user-name>::::defaultpriv=basic,dtrace_proc,dtrace_kernel File # ppriv -s A+dtrace_proc,dtrace_kernel PID where PID is the process ID of the user's shell. A word of caution is in order. The DTrace privileges described above will allow the use of all facilities of DTrace (including the kernel facilities). Please use these privileges responsibly and be aware that they could permit Denial of Service (DoS) attacks on your systems. Using these privileges for running our DTrace script The DTrace script #!/usr/sbin/dtrace -qws This means this script can be run directly (assuming that it has appropriate execute permissions) and that it will run the #pragma D option strsize=500This option instructs DTrace to allow strings up to 500 characters long. The default size of 256 is not big enough for our purposes.
proc:::signal-send
/(args[2] == SIGBUS || args[2] == SIGSEGV) &&
pid == args[1]->pr_pid/
In DTrace terminology, these lines specify the use of the stop(); This means the process that generated such a signal is stopped until later notice. system( "%s=%d; %s=%d; %s=%d; %s=%s; %s %s %s %s %s %s %s %s %s", "CRASH_PID", pid, "CRASH_UID", uid, "DTRACE_UID", $uid, "PROG", execname, "SCRIPT=`/bin/pargs -e $CRASH_PID | ", " /bin/grep ON_APP_CRASH_INVOKE | /bin/cut -d= -f2`;", "[ -z \"$SCRIPT\" -o ! -x \"$SCRIPT\" ] && exit 0;", "if [ $DTRACE_UID -eq 0 -a $CRASH_UID -ne 0 ] ; then", " USER_NAME=`/bin/getent passwd $CRASH_UID|/bin/cut -d: -f1`;", " /bin/su $USER_NAME -c \"$SCRIPT $CRASH_PID $PROG\";", "else ", " $SCRIPT $CRASH_PID $PROG; ", "fi" ); This long line executes the specified sequence of Bourne shell commands. We could have instead introduced a helper shell script that would be easier to read, but that would complicate installation somewhat, so we chose to use a one-line command. Note that the The above script performs the following steps:
system("/bin/prun %d", pid);
The example user-defined script
/bin/pstack $PID /bin/pmap -x $PID /bin/pldd $PID /bin/ptree $PID /bin/pargs -ace $PID /bin/plimit -m $PID /bin/pwdx $PID /bin/pfiles $PID The commands specific to the crashing process are based on the Solaris A possibility exists that some applications use the signals Also note that one of the advantages of
Consider the following simple test program which contains a bug. It dereferences a null pointer in subroutine
% cat test1.c
#include <stdio.h>
#include <stdlib.h>
static void sub2(int *p)
{
int i;
i = *p;
}
static void sub(int *p)
{
sub2(p);
}
int main()
{
int *p=NULL;
sub(p);
return 0;
}
% cc -o test1 test1.c
Step 1 Let us start the % ./app_crash.d & [1] 5707 Step 2 Now define the necessary environment variable in a different terminal window: % setenv ON_APP_CRASH_INVOKE $HOME/tests/runme_on_app_crash
Step 3 Execute the test1 program in the terminal window where % test1 Segmentation Fault
At the time of the crash, the information was collected in the
% ls -lt /var/tmp/ | head -2
total 42
-rw-r--r-- 1 gregns staff
4037 Apr 20 11:30 /var/tmp/appcrash.test1.5174
% cat /var/tmp/appcrash.test1.5174
Output from runme_on_app_crash
Program: test1
Process ID: 5174
Application Debugging Data
--------------------------
> /bin/pstack 5174
5174: test1
08050652 sub2 (0) + 12
08050688 sub (0) + 18
080506bf main (1, 8047cec, 8047cf4) + 1f
080505aa ???????? (1, 8047db0, 0, 8047db6, 8047dc8, 8047e49)
> /bin/pmap -x 5174
5174: test1
Address Kbytes RSS Anon Locked Mode Mapped File
08047000 4 4 4 - rwx-- [ stack ]
08050000 4 4 - - r-x-- test1
08060000 4 4 4 - rwx-- test1
FEEE0000 4 4 4 - rwx-- [ anon ]
FEEF0000 24 12 12 - rwx-- [ anon ]
FEF00000 724 724 - - r-x-- libc.so.1
FEFC5000 24 24 24 - rw--- libc.so.1
FEFCB000 8 8 8 - rw--- libc.so.1
FEFDA000 128 128 - - r-x-- ld.so.1
FEFFA000 4 4 4 - rwx-- ld.so.1
FEFFB000 8 8 8 - rwx-- ld.so.1
-------- ------- ------- ------- -------
total Kb 936 924 68 -
> /bin/pldd 5174
5174: test1
/lib/libc.so.1
> /bin/ptree 5174
225 /usr/lib/inet/inetd start
5139 /usr/sbin/in.rlogind
5141 -csh
5174 test1
> /bin/pargs -ace 5174
5174: test1
argv[0]: test1
envp[0]: HOME=/home/gregns
... [removed more environment variable settings] ...
envp[19]: ON_APP_CRASH_INVOKE=
/home/gregns/tests/runme_on_app_crash
> /bin/plimit -m 5174
5174: test1
resource current maximum
time(seconds) unlimited unlimited
file(mbytes) unlimited unlimited
data(mbytes) unlimited unlimited
stack(mbytes) 10 unlimited
coredump(mbytes) 0 unlimited
nofiles(descriptors) 256 65536
vmemory(mbytes) unlimited unlimited
> /bin/pwdx 5174
5174: /home/gregns/tests
> /bin/pfiles 5174
5174: test1
Current rlimit: 256 file descriptors
0: S_IFCHR mode:0620 dev:270,0 ino:12582924 uid:28715
gid:7 rdev:24,4
O_RDWR
/devices/pseudo/pts@0:4
1: S_IFCHR mode:0620 dev:270,0 ino:12582924 uid:28715
gid:7 rdev:24,4
O_RDWR
/devices/pseudo/pts@0:4
2: S_IFCHR mode:0620 dev:270,0 ino:12582924 uid:28715
gid:7 rdev:24,4
O_RDWR
/devices/pseudo/pts@0:4
System Configuration Data
-------------------------
> /bin/uname -a
SunOS rahova 5.10 Generic i86pc i386 i86pc
> /bin/cat /etc/release
Solaris 10 3/05 s10_74L2a X86
Copyright 2005 Sun Microsystems, Inc.
All Rights Reserved.
Use is subject to license terms.
Assembled 22 January 2005
> /usr/sbin/psrinfo -v
Status of virtual processor 0 as of: 04/20/2005 11:30:49
on-line since 03/30/2005 14:43:48.
The i386 processor operates at 2393 MHz,
and has an i387 compatible floating point processor.
Status of virtual processor 1 as of: 04/20/2005 11:30:49
on-line since 03/30/2005 14:43:53.
The i386 processor operates at 2393 MHz,
and has an i387 compatible floating point processor.
> /usr/sbin/swap -s
total: 62464k bytes allocated + 12248k reserved =
74712k used, 6891300k available
> /usr/sbin/swap -l
swapfile dev swaplo blocks free
/dev/dsk/c1t0d0s1 28,65 8 8389432 8389432
> /usr/sbin/prtconf|/bin/head -2
System Configuration: Sun Microsystems i86pc
Memory size: 3327 Megabytes
> /bin/showrev -p|/bin/cut -d' ' -f2|/bin/sort
116299-08
116303-02
The above file is ready to be sent to the owner of the faulty
application for debugging. Note that the traceback produced by
Microsoft Windows has an interesting functionality in this area: see reference [9], Windows Error Reporting for Developers. Not only does Microsoft provide the infrastructure for the ISVs to automatically collect the crash data (which is what Microsoft encrypts the collected data such that only the intended ISV or Microsoft employees can decrypt it. This is not a bad idea, but that method still doesn't allow the users to inspect the data before allowing it to leave their sites. Nor does it let the users control what information to collect. For more information on how Microsoft collects the data on crashes, see reference [10], Microsoft Online Crash Analysis Data Collection Policy. For further information about Microsoft minidumps, see reference [11], Post-Mortem Debugging Your Application with Minidumps and Visual Studio .NET. We think the Apple MAC OS X also appears to have impressive capabilities in this area, although we haven't tested them. For details, see reference [12], Mac OS X CrashReporter. Specialized commercial products and services are available to perform automated crash monitoring and analysis for applications. For one example, see reference [13]. Related discussions are also available in reference [1] and reference [3]. This article describes a DTrace-based solution allowing ISVs and users of the Solaris OS to safely collect debugging information when any application crashes, and thus help improve the quality of the applications and reduce the costs of software defects. The users can fully automate such diagnostic data collection and transmission if they want, while having full control over which information is collected and sent to the application developer and/or system vendor for analysis and remediation. For AppCrash updates and related discussions, see reference [14].
Greg Nakhimovsky and Morgan Herrington are Sun engineers working with application software vendors to make sure their products run well on Sun systems. | |||||||||||||||||||
Oracle is reviewing the Sun product roadmap and will provide guidance to customers in accordance with Oracle's standard product communication policies. Any resulting features and timing of release of such features as determined by Oracle's review of roadmaps, are at the sole discretion of Oracle. All product roadmap information, whether communicated by Sun Microsystems or by Oracle, does not represent a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. It is intended for information purposes only, and may not be incorporated into any contract.
|
| ||||||||||||