Sun Java Solaris Communities My SDN Account Join SDN

Article

Eliminating Nonreentrant Library Calls in Multithreaded Programs

 
By Bruce Chapman, February 2002  

Introduction

For a variety of reasons, nonreentrant library calls are often made from multithreaded programs. The potential for problems is often not caught until the code is placed under heavy load on a multiprocessor machine. To make matters worse, calls such as ctime(3C) can corrupt the process' heap and cause crashes in unrelated areas. This paper provides a list of APIs to avoid, as well as a means of checking to make sure your code is not harboring dangerous, nonreentrant library calls.

Problem Explanation

With the Solaris Operating Environment (OE), to avoid making the mistake of using nonreentrant library calls, you must look at the man page for every library call your code makes. If a reentrant <name>_r version of the call exists, use it; that is the only safe call to make in a multithreaded (MT) program. Unfortunately, not all engineers have the discipline to make this approach effective.

This kind of insidious bug can creep into your MT code in many other ways as well. One example of this is an MT server that allows you to run code written by others (like a plug-in) in an MT environment. Another involves code that is ported from a platform that protects you from this type of bug (at a small cost to performance). There is also the case of code that was written in a single-threaded fashion, but later ported to an MT environment.

In extreme cases, you could accomplish 100 percent code coverage during QA, and not run up against one of these bugs until production. To illustrate this, please consider the trivial MT program that follows, mtunsafe.c, which creates several threads, each of which loops while incrementing a global counter and calling time()/ctime().

#include <stdio.h> 
#include <thread.h> 
#include <time.h> 

#define MAXTHREADS 10000 

static int Threads; 
static volatile int StartYet; 
static volatile int GlobCount = 0; 
 

#define THR_MSLEEP(millisecs) poll(NULL,0L,(int)millisecs) 

void *test_thread(void *nthr) { 
  time_t t; 
  char *ct; 

  while (!StartYet) THR_MSLEEP(500); 

  while(1) { 
    t = time(NULL); 
    ct = ctime(&t); 
    GlobCount++; /* this too is unsafe, but we don't care too much! */ 
  } 
} 

main(int argc, char **argv) { 
  int nthr; 
  thread_t threads[MAXTHREADS]; 

  if (argc < 2) { 
    printf("usage : %s <numthreads>n",argv[0]); 
    exit(1); 
  } 

  Threads = nthr = atoi(argv[1]); 

  StartYet = 0; 
  while ( nthr -- ) { 
    THR_MSLEEP(10); /* stagger creation */ 
    thr_create(NULL, NULL , test_thread, (void *) nthr, 
               THR_NEW_LWP , &threads[nthr]); 
  } 

  while (1) { 
    poll(NULL,0L,300); 
    printf("main looping...total count so far %dn",GlobCount); 
    StartYet = 1; 
    if (GlobCount > 100000) { 
      exit(0); 
    } 
  } 
}

This program has been run with up to 1600 threads on a single-processor machine with no problems. It was run with 100 threads on a 4-CPU machine and also ran fine. Only when run with 200 threads on a 4-CPU machine did the nonreentrant call to ctime() finally cause a crash. Since what ctime() has done is corrupt the C heap, the crash could have occurred elsewhere in a program that manipulated the C heap on its own. This is often what happens, so engineers spend weeks trying to track down the bug.

Here's a log of execution on a 4-CPU machine:

gyruss 81 =>cc -mt mtunsafe.c 
gyruss 82 =>a.out 100 
main looping...total count so far 0 
main looping...total count so far 8711 
main looping...total count so far 15906 
main looping...total count so far 24577 
main looping...total count so far 31760 
main looping...total count so far 40420 
main looping...total count so far 48998 
main looping...total count so far 57670 
main looping...total count so far 66342 
main looping...total count so far 75018 
main looping...total count so far 83688 
main looping...total count so far 92340 
main looping...total count so far 101021 
gyruss 83 =>a.out 200 
main looping...total count so far 0 
main looping...total count so far 3348 
main looping...total count so far 11882 
main looping...total count so far 18846 
main looping...total count so far 32364 
main looping...total count so far 40087 
main looping...total count so far 49300 
Segmentation fault (core dumped) 
gyruss 84 =>dbx - core 
dbx: Using "/tmp/a.out" 
Reading a.out 
core file header read successfully 
Reading ld.so.1 
Reading libthread.so.1 
Reading libc.so.1 
Reading libdl.so.1 
Reading libc_psr.so.1 
detected a multithreaded program 
t@135 (l@135) terminated by signal SEGV 
  (no mapping at the fault address) 
0xff2c116c: _smalloc+0x008c: ld [%o1 + 0x8], %o0 
(/net/woornack/files2/forte6u2/SUNWspro/bin/.
  ./WS6U2/bin/sparcv9/dbx) where 
current thread: t@135 
=>[1] _smalloc(0x10, 0xff33e728, 0x4, 0x10, 0x0, 
  0x0), at 0xff2c116c 
[2] malloc(0xb, 0xfffffff9, 0xffffffff, 0xff2d192c, 
  0x81010100, 0xff00), at 0xff2c11ac 
[3] tzcpy(0x25a40, 0xff33e8ac, 0x0, 0xa, 0xff338000, 
  0xffbefd53), at 0xff2d1948 
[4] getzname(0xffbefd5d, 0xff33b524, 0x0, 0xff33b524, 
  0xffbefd53, 0x0), at 0xff2d1890 
[5] _ltzset_u(0x3c5f0a23, 0xff338000, 0x0, 0x0, 0x0, 0x1), 
  at 0xff2d1394 
[6] localtime_u(0xf2b05d10, 0xff33e8b4, 0x0, 0x0, 
  0xff338000, 0xff2b731c), at 0xff2d055c 
[7] ctime(0xf2b05d10, 0xff33e8b4, 0x0, 0x0, 0x80ccc, 
  0x107e4), at 0xff2b731c 
[8] test_thread(0x44, 0xfd9d3d18, 0x1, 0xff39ae04, 0x0, 
  0xfe400000), at 0x107e4 
(/net/woornack/files2/forte6u2/SUNWspro/bin/.
  ./WS6U2/bin/sparcv9/dbx) quit

Different Approaches

So how do you eliminate this type of problem? The tedious way is performing source code analysis by referring back to library call man pages for the Solaris OE. Another approach is to use Solaris software tools to look at all the libraries a binary uses. You can do this statically, or with a running process. As many binaries dynamically load libraries, the latter is more likely to be complete. In the early stages of code development and testing, these problems may not yet have manifested themselves. Before deployment, you can do some basic checking, even on the simple example presented above:

rx7 143 =>ldd a.out 
libthread.so.1 => /usr/lib/libthread.so.1 
libc.so.1 => /usr/lib/libc.so.1 
libdl.so.1 => /usr/lib/libdl.so.1 
/usr/platform/SUNW,Sun-Blade-1000/lib/libc_psr.so.1

Note: since all of the above are libraries for the Solaris OE, you don't need to search each of them in turn. Alternatively, you can view libraries of a running process:

rx7 144 =>a.out 100 & 
[2] 6221 
rx7 145 =>pldd 6221 
6221: a.out 200 
/usr/lib/libthread.so.1 
/usr/lib/libc.so.1 
/usr/lib/libdl.so.1 
/usr/platform/sun4u-us3/lib/libc_psr.so.1 

rx7 146 =>nm a.out | grep UNDEF 
[56] | 0| 0|NOTY |WEAK |0 |UNDEF |__1cG__CrunMdo_exit_code6F_v_ 
[50] | 133676| 0|FUNC |GLOB |0 |UNDEF |_exit 
[70] | 133760| 0|FUNC |WEAK |0 |UNDEF |_get_exit_frame_monitor 
[65] | 133652| 0|FUNC |GLOB |0 |UNDEF |atexit 
[53] | 133736| 0|FUNC |GLOB |0 |UNDEF |atoi 
[74] | 133712| 0|FUNC |GLOB |0 |UNDEF |ctime 
[73] | 133664| 0|FUNC |GLOB |0 |UNDEF |exit 
[58] | 133688| 0|FUNC |GLOB |0 |UNDEF |poll 
[49] | 133724| 0|FUNC |GLOB |0 |UNDEF |printf 
[45] | 133748| 0|FUNC |GLOB |0 |UNDEF |thr_create 
[68] | 133700| 0|FUNC |GLOB |0 |UNDEF |time

Of course, the example here is a very simple case; usually many libraries will have to be searched for all the different nonreentrant calls, and libraries may be dynamically loaded when putting the code through its paces.

An Efficient Solution Using a Tool

Here is the source code for a simple library, multithreaded_nonreentrant.c, that can interpose all the nonreentrant library calls for the Solaris OE that have reentrant equivalents. Simply compile it, set the environment variable LD_PRELOAD to point to it, then run your code:

rx7 368 =>cc -mt -o nonreentrant.so.1 -G -K pic multithreaded_nonreentrant.c
rx7 369 =>setenv LD_PRELOAD ./nonreentrant.so.1 
rx7 370 =>setenv PrintInfo 1 
rx7 371 =>./a.out 10 
INFO : Interposed thr_create looking up real function ptr 
main looping...total count so far 0 
**ERROR** - thread 4 calling MT unsafe ctime(); threads: 14 
**ERROR** - thread 4 calling MT unsafe localtime(); threads: 14 
**ERROR** - thread 4 calling MT unsafe asctime(); threads: 14 
main looping...total count so far 5851 
main looping...total count so far 25225 
main looping...total count so far 44296 
main looping...total count so far 62846 
main looping...total count so far 82171 
main looping...total count so far 100201

To pinpoint where the offending calls are made, set the PrintStack environment variable. You'll notice that ctime() is the only routine actually called by a.out (in the test_thread() function), while localtime() and asctime() are called by ctime():

rx7 372 =>setenv PrintStack 1 
rx7 373 =>./a.out 10 
INFO : Interposed thr_create looking up real function ptr 
main looping...total count so far 0 
**ERROR** - thread 4 calling MT unsafe ctime(); threads: 14 
unknown lib:??+0x11f437e 
./nonreentrant.so.1:print_stack+0x38 
./nonreentrant.so.1:ctime+0x108 
a.out:test_thread+0x54 
/usr/lib/libthread.so.1:_getfp+0x124 
a.out:test_thread+0x0 
**ERROR** - thread 4 calling MT unsafe localtime(); threads: 14 
unknown lib:??+0x11f444e 
./nonreentrant.so.1:print_stack+0x38 
./nonreentrant.so.1:localtime+0x108 
/usr/lib/libc.so.1:ctime+0x4 
./nonreentrant.so.1:ctime+0x194 
a.out:test_thread+0x54 
/usr/lib/libthread.so.1:_getfp+0x124 
a.out:test_thread+0x0 
**ERROR** - thread 4 calling MT unsafe asctime(); threads: 14 
unknown lib:??+0x11f43ee 
./nonreentrant.so.1:print_stack+0x38 
./nonreentrant.so.1:asctime+0x108 
./nonreentrant.so.1:ctime+0x194 
a.out:test_thread+0x54 
/usr/lib/libthread.so.1:_getfp+0x124 
a.out:test_thread+0x0 
main looping...total count so far 6660 
main looping...total count so far 27012 
main looping...total count so far 47074 
main looping...total count so far 65870 
main looping...total count so far 84243 
main looping...total count so far 103323

In this case, the bug is fixed by changing two lines in mtunsafe.c:

char *ct;=> char *ct, ctimebuf[60];

and

ct = ctime(&t); => ct = ctime_r(&t,ctimebuf,sizeof(ctimebuf));

Not all programs work well with LD_PRELOADed libraries, so you could compile or link the nonreentrant.so code directly into your program, or even convert it to use the much more complicated linker auditing mechanism. See the Solaris 8 Software Developer Collection book titled Linker and Libraries Guide under the topic Runtime Linker Auditing Interface for details of how to use this mechanism (http://docs.sun.com).

Please Note: while this tool is recommended for development and QA purposes, this should not be part of any actual deployment and will not be supported.

The following is a list of nonreentrant standard library calls for the Solaris 8 OE that you should never make in a multithreaded program:

ctime, localtime, gmtime, asctime, strtok, gethostbyname, gethostbyaddr, getservbyname, getservbyport, getprotobyname, getnetbyname, getnetbyaddr, getrpcbyname, getrpcbynumber, getrpcent, rand, ctermid, tmpnam, readdir, getlogin, getpwent, getpwnam, getpwuid, getspent, fgetspent, getspnam, getgrnam, getgrgid, getnetgrent, getrpcbyname, tempnam, fgetpwent, fgetgrent, ecvt, gcvt, getservent, gethostent, getgrent, fcvt

Conclusion

Oft-overlooked, nonreentrant calls in MT code can tend to bite developers late in the development cycle. The author has provided some tips and tools to catch many such problems before deployment.

Now, get out there and eliminate these nasty lurking bugs!

About the Author

Bruce Chapman is a staff engineer who has been with Sun Microsystems for seven years.

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.