Summary: Many of us have had to deal with application crashes that are hard or impossible to reproduce, especially at software developers' sites. Yet it can be difficult to fix the problem without reproducing it. This feature describes a few tools to help you generate a traceback, a chain of all function calls that the application was executing at the time of the crash. A traceback will help locate the trouble or at least narrow it down. Later you can also collect crash statistics and use them to enhance the quality-control procedures. When a Solaris application crashes, it usually produces a core file, which is a disk copy of the application's memory at the time of the crash.
One way to generate a traceback is to use a debugger, such as % dbx /path/executable core (dbx) where > traceback.txt (dbx) quit
In addition, starting with the Solaris 8 operating environment, the
If you can't use
Handling application core files
Generating a Traceback from the Application Note that generating a traceback on crash provides only one more piece of the puzzle in determining why the application has crashed and how to fix the problem. Nevertheless, such a traceback may lead to the underlying problem causing the crash, or at the very least to determining whether the problem is in the application or in the system. In any case, getting a traceback is a step in the right direction. The article, Debugging and Performance Tuning with Library Interposers, describes how to build library interposers and use them for various debugging and performance tuning tasks. Some of the tools described in this article are additional applications of the library interposition technology.
If the application does not have signal handlers for SIGSEGV and SIGBUS installed, the following simple library interposer
You can determine whether the application has those signal handlers installed or not using the % psig 10630 | egrep ":|^SEGV|^BUS" 10630: dtpad /tmp/test1.c BUS default SEGV default
That tells us that If the application does have a signal handler installed for the signal causing the crash, see the next section of this article.
Note: If you use Netscape Navigator as your browser, you can download binary files such as
In that interposer, I install signal handlers for SIGSEGV and SIGBUS and then call
You can add any other fatal signal to the list specified in
Also note that
I built the % setenv LD_PRELOAD /full_path/produce_traceback.so [run the application here] If you have access to a Sun compiler and want to build the interposer library yourself, you can do it this way: % cc -o produce_traceback.so -G -Kpic produce_traceback.c
Here is an example of using the interposer. I'm artificially sending a % setenv LD_PRELOAD ./produce_traceback.so % /opt/netscape/netscape & [1] 28966 % kill -10 28966 Processing signal 10 28966: /opt/netscape/netscape ef2b927c waitid (0, 7129, ef2013e0, 103) ef2d4254 _libc_waitpid (7129, ef2014c8, 100, 0, ef323180, ef2e8f48) + 54 ef2e8f48 system (ef2016a0, e16c28, ef325b30, ef323180, 0, 0) + 1f4 ef7a07a8 handle_crash (a, 0, ef2017e0, 0, 0, 0) + 60 ef2b8a0c sigacthandler (a, 0, ef2017e0, 20, 0, 200) + 28 ef2ccc74 select (ef201b18, ef3260fc, ef3260fc, ef326100, ef326100, 9) + 280 008f6e00 _OS_SELECT (9, ef203d70, 0, 0, ef203c68, 8f7bbc) + 14 008f7bf8 _PR_PauseForIO (0, 8, ffffffff, ffffffff, 0, 0) + 4a0 008f7e00 _PR_Idle (0, 0, 0, 0, 0, 0) + 20 008f6070 HopToad (8f7de0, 0, 0, e44830, 0, 0) + 14 008f60a8 HopToadNoArgs (1, 0, 0, 0, 0, 0) + 20 00000000 ???????? (0, 0, 0, 0, 0, 0) A copy of the traceback is stored in /var/tmp/traceback.txt. [1] Exit 1 /opt/netscape/netscape Note that the Netscape executable is stripped: % file /opt/netscape/netscape /opt/netscape/netscape: ELF 32-bit MSB executable SPARC Version 1, dynamically linked, stripped See "Dealing with Hidden Function Names" below for a discussion of stripped executables.
If application has signal handlers for SIGSEGV and SIGBUS There are two ways to circumvent that problem:
Dealing with Hidden Function Names
For the stripped executable's symbols that
In addition to stripping, some applications hide their function names in another way. They use a linker mapfile (
When it can't determine the function name, To solve that problem, I've written a utility in Perl, which recovers the lost function names when you or the software developer have access to the unstripped version of the same executable as the one that crashed. Here is the Perl source code of that utility.
When the application crashes, you can produce a traceback, either using a library interposer described above or one from the core file, and send the result to the appropriate software development or support organization. They can use the % unstrip_traceback traceback.txt unstripped_executable
Internally, the
Here is an example of Output from test1.c: % cc -o test1 test1.c % cp test1 test1_unstripped % strip test1 % test1 | tee traceback.txt 503: test1 ff31a5ac waitid (0, 1f9, ffbeece0, 103) ff2d4d88 _waitpid (0, ffbeedc8, 100, ffbeedc8, 23208, ff310120) + 60 ff310134 system (ffbeef98, 10a28, 1f7, 0, 0, 0) + 204 0001089c handler (b, 0, ffbef098, ff338000, 0, 0) + 2c ff319834 sigacthandler (b, 0, ffbef098, 0, 0, 0) + 28 --- called from signal handler with signal 11 (SIGSEGV) --- 000108d4 ???????? (0, b, ffbef3f8, ffbef3d8, 231e0, ff2ccf88) 00010900 ???????? (0, 10870, b, 5, 23664, ff29b68c) 0001093c main (1, ffbef4e4, ffbef4ec, 20800, 0, 0) + 1c 00010848 _start (0, 0, 0, 0, 0, 0) + 108 % unstrip_traceback traceback.txt test1_unstripped This is a pstack traceback Running nm for test1_unstripped ... Searching for the functions missing from traceback ... 503: test1 ff31a5ac waitid (0, 1f9, ffbeece0, 103) ff2d4d88 _waitpid (0, ffbeedc8, 100, ffbeedc8, 23208, ff310120) + 60 ff310134 system (ffbeef98, 10a28, 1f7, 0, 0, 0) + 204 0001089c handler (b, 0, ffbef098, ff338000, 0, 0) + 2c ff319834 sigacthandler (b, 0, ffbef098, 0, 0, 0) + 28 --- called from signal handler with signal 11 (SIGSEGV) --- 000108d4 sub2 (0, b, ffbef3f8, ffbef3d8, 231e0, ff2ccf88) + 0xc 00010900 sub (0, 10870, b, 5, 23664, ff29b68c) + 0x8 0001093c main (1, ffbef4e4, ffbef4ec, 20800, 0, 0) + 1c 00010848 _start (0, 0, 0, 0, 0, 0) + 108 %
As you can see,
Working Around Possible Complications
Also, strictly speaking, In any case, it's easy enough to circumvent those potential problems. For example:
/* a file in /var/tmp will survive a reboot but not in /tmp */
/* 0 2345678 1 2345678 2 2345678 3 */
char buf[]="/usr/proc/bin/pstack |/bin/tee /var/tmp/traceback.txtn";
char cdigits[]="0123456789";
int n, i;
n = (int)getpid();
i = 30; /* index to character in buf */
while ( n !=0 ) /* keep dividing by 10 */
{
buf[i--] = cdigits[n%10];
n /= 10;
}
(void)putenv("LD_PRELOAD=");
system(buf);
/* write(2) is safe in a signal handler, while printf(3S) is not */
/* 0 2345678 1 2345678 2 2345678 3 2345678 4 2345678 5 2345678 6 */
write(1,"A copy of the traceback is stored in /var/tmp/traceback.txtn",60);
One more possible problem with the
In the worst case, even if a deadlock occurs, the application will hang. You can still determine the process ID (using
Acknowledgments Resources
About the authorGreg Nakhimovsky is a member of the technical staff at Sun Microsystems, working with independent software vendors to make sure their applications run well on Sun systems. He has 20 years of industry experience developing, performance tuning, and troubleshooting technical computer applications. July 2001 | ||||||||
|
| ||||||||||||