|
April 2000
Summary
This article introduces you to STL standard library and thread-safety issues
when the standard library is used in a multi-threaded mode. It explains
how a current implementation, which uses coarser granularity locks, causes
performance penalty. The penalty is assessed using a simple test that creates
string objects in a multi-threaded mode. This article also describes an
alternative locking technique and a possible future enhancement that can
improve performance and compares their relative merits to the current mutex
based critical sections. Finally, it explains how to use the workshop
compilers with other standard libraries that are currently available.
Introduction
The Sun WorkShop 5.0 compiler currently provides a C++ standard library in archive
form (libCstd.a). You can use this library to develop threaded applications.
The standard library is based on Rogue Wave standard library version 2.01.01.
Rogue Wave documentation on thread safety states that:
-
If you use the library in a multi-threading environment (and you make use
of multiple threads) then you will need to compile with _RWSTD_MULTI_THREAD
defined and link to a version of the library built with the option -DTHREAD=MULTI
(rwstdmt.lib). This will give you a multi-thread safe version of the library.
The multi-thread safe version of the library makes sure that the internals
of the library all work properly in a multi-threading environment. You
will still need to lock around any library objects that you share between
threads (except for iostreams and locale objects).
For example, if you instantiate a string and create a new thread and you
want to pass that string to the thread by reference, you will need to lock
around write access to that string because you are explicitly sharing the
one string object between threads. The library provides a mutex class (mutex_lock
and mutex_unlock from the thread library) to accomplish this task.
On the other hand, if you pass the string to the new thread by value,
then you will not need to lock, even though the string in the two different
threads may share a representation through Rogue Wave's "copy on write"
technology. The library handles the locking automatically and provides
thread-safety at a finer granularity level via a separate reference counting
class (for example, string_ref for strings.) In other words, the reference
counting is updated automatically in a thread safe manner. Hence you are
only required to lock when making an object available to multiple threads
explicitly, either by passing references between threads or by using global
or static objects.
Back to Top
Copy on Write and Reference Counting
Classes (for example, string) use a technique called copy on write
to minimize copying. This technique offers the advantage of easy-to-understand
value semantics with the speed of reference counted pointer implementation.
This section illustrates how the technique works. When you initialize
a String with another String via the copy constructor:
String(cont RWCString&);
The two strings share the same data until one of them tries to write to
it. At that point, a copy of the data is made, and the two strings go their
separate ways. Copying only at "write" time makes copies of strings, particularly
read-only copies, very inexpensive.
The following example shows how four objects share one copy of a string
until one of the objects attempts to change the string:
#include <string.h>
String g;
// Global object
void setGlobal(String x) { g = x; }
main(){
String a("kernel"); // 1
String b(a); // 2
String c(a); // 3
setGlobal(a); // Still only one copy of "kernel"! //4
b += "s"; // Now b has its own data: "kernels" //5
}
| //1 |
Object a is the actual allocation
and initialization of the memory to hold the string kernel. |
| //2-//3 |
When objects b and c are created
from a, they merely increment a reference count in the original
data and return. At this point, there are three objects that look
at the same piece of data. |
| //4 |
The function setGlobal() sets the
value for g, the global string to the same value. Now the reference count
is up to four, and there is still only one copy of the string kernel. |
| //5 |
Finally, object b tries to change
the value of the string. It looks at the reference count and sees that
it is greater than one, which implies that the string is being shared by
more than one object. At this point, a clone of the string is made and
modified. The reference count of the original string drops back down to
three, and the reference count of the newly cloned string is one. |
The counter for the reference increments and decrements under the protection
of a lock.
Back to Top
LEVEL 2 Thread_safety
The library is safe when only pass by value is used. So if you pass std::strings
by reference, you will need to lock access and enforce one-thread-at-a-time
access yourself. Take the following example:
std::string str = "A string";
a()
{
...
thr_create(NULL, 0, b, (LPVOID)NULL, 0,&tid);
str = "";
}
b()
{
std::string strcopy = str;
}
If both assignments happen at the same time, the first one can end up in
the reference by going to 0, thus freeing the memory. Meanwhile, the second
assignment can see the reference at 1 and decides to add a reference to
a now deallocated replica. This example demonstrates that even a simple
operation such as "=" should have a lock to prevent a race condition. In
other words, every accessible member function should be guarded and such
can result in severe performance penalty.
To access the performance implication, a level_2 thread safety model
was implemented for string class. The header files and string class member
were modified to accommodate one thread at a time access. Take a look at
a sample implementation:
#if defined (_RWSTD_MULTI_THREAD) && defined
(_RWSTD_STRING_MT_SAFETY_LEVEL_2)
# define MT_GUARD(name) _RWSTDGuard name (this->__mutex)
#else
# define MT_GUARD(name) ((void)0)
#endif // _RWSTD_MULTI_THREAD
When level_2 thread safety is defined, Rogue Wave's STDGuard method gets
invoked, which defaults to a void.
The actual implementation for the operator "=" method looks like the
following:
template <class charT, class traits, class Allocator
> |
basic_string<charT, traits, Allocator> &
basic_string<charT, traits, Allocator>::operator= (
const basic_string<charT, traits, Allocator>& str)
{
if (this != &str)
{
MT_GUARD(guard);
if (str.__pref()->__references() > 0)
{
str.__pref()->__addReference();
__unLink();
__data_ = str.__data_.data();
}
else
this->operator=(str.c_str());
}
return *this;
}
As you can see, the locking is introduced at a much coarser level that
is thus serializing the entire invocation.
Back to Top
Atomic updates for reference counting
Another optimization uses atomic updates for counters instead of mutex.
Level 2 type guarding at coarser grain shows that the critical section
is much larger. It also shows that mutex based guard overheads are small
in comparison to the amount of work done in the critical section. However,
reference counting, and counter increment and decrement operations are
more efficient using atomic updates because the overheads from calling
mutex methods can be much higher.
This section discusses the two different optimizing techniques. One
uses a more general swap based on spin blocking while the other uses compare
and swap based updates. You can use spin blocking as a general replacement
for mutexes. The CAS based Fetch_and_Add is more specific to counter increments
and decrements and requires no guarding.
Swap based spin blocking for guard methods are shown below and these
methods were implemented in regular header file in a transparent manner.
In the following example, SWAPINITLOCK, SWAPLOCK and SWAPUNLOCK macros
are called in the place of pthread_mutex_init, pthread_mutex_lock and pthread_mutex_unlock:
extern "C" Test_and_Set(long *,long);
#define SWAPINITLOCK(lp)
(*(lp) = 0)
#define SWAPLOCK(lp)
while (Test_and_Set(lp, 1L));
#define SWAPUNLOCK(lp)
Test_and_Set(lp,0L)
The Test_and_Set is implemented as a inline function using swap:
.inline Test_and_Set,8
mov %o0,%o2
mov %o1,%o0
swap [%o2],%o0
.end
In CAS based updates, _RWSTD_MT_INCREMENT macro is defined to be a Fetch_and_Add
inline assembler function:
.inline fetch_and_add,4
retry:
ld [%o0],%l0
add %l0,%o1,%l1
cas [%o0],%l0,%l1
cmp %l0,%l1
bne retry
mov %l1,%o0
nop
.end
Back to Top
Performance Measurement
A simple multi-threaded server was designed to test and compare performances.
For each case, a new library (libCstd.a) was built and timing measurements
were done under the same condition on the same system. The ultra 60 repeatedly
created string objects in a mt_mode.
| No. of threads |
No. of objects |
method |
level_2 |
elapsed time |
| 2 |
10,000 |
mutex |
off |
31 sec |
| 2 |
10,000 |
mutex |
on |
41 sec |
| |
| 2 |
10,000 |
swap/block |
off |
26 sec |
| 2 |
10,000 |
swap/block |
on |
35.7 sec |
| |
| 2 |
10,000 |
CAS/non-blkg |
off |
25 sec |
| |
| 8 |
10,000 |
mutex |
off |
128 sec |
| 8 |
10,000 |
mutex |
on |
168 sec |
| |
| 8 |
10,000 |
swap/block |
off |
153 sec |
| 8 |
10,000 |
swap/block |
on |
170 sec |
| |
| 8 |
10,000 |
CAS/non-blkg |
off |
105 sec |
The tests conclude that coarser grain locks (level 2) produce 25-30% performance
penalty and is more efficient when objects are shared by value among threads
than by reference. Furthermore, the atomic swap based updates show performance
benefits only when the number of CPUs on the system is greater or equal
to the number of concurrent servers. When the number of servers exceed
the amount of available CPUs, swap based spin blocking is inefficient and
shows marked performance degradation.
A new technique that completely avoids guard class for counter increments
and decrements has been implemented. It uses SPARC -v9 CAS instruction for
atomic update of the reference counting. This technique by far shows the
best performance of all schemes and the results on 8 CPUs show up to 40-50%
in performance benefit.
This table will be updated as soon as tests are completed.
Conclusion
A study was performed using a multi-threaded server to assess the performance
of standard library using different levels of synchronization. Present
study shows the advantage of providing users with the ability to have a
pluggable STL and standard library so that they can tune their requirements.
Current evaluation of other STL implementations has been successful in
integrating with Workshop compilers.
April 2000
Back to Top
|
|