Sun Java Solaris Communities My SDN Account Join SDN

Article

Atomic SPARC: Using the SPARC Atomic Instructions

 
By Richard Marejka, March 2008  

Article Index

The SPARC architecture is a RISC processor that originally appeared in systems from Sun Microsystems in 1986 as the Sun-4/260. Since then, the processor has undergone many refinements to meet the changing needs of its customers.

Today's SPARC processors are designed for both high performance and energy efficiency. This is achieved by incorporating multiple cores on a processor die and multiple processors within a system. These systems yield excellent results when applied to multiprocess and multithreaded jobs. The degree of parallelism and efficiency could not be achieved without atomic instructions. These instructions provide the basis for the synchronization required to achieve the high degree of parallelism associated with the Solaris Operating System (Solaris OS) and SPARC.

This article briefly introduces the SPARC memory model and atomic instructions, then implements a number of IBM AIX interfaces for use on the Solaris OS. The article assumes that the reader is familiar with assembler language programming. This collection of examples serves to demonstrate the use of the SPARC atomic instructions and memory models, and it provides a small library that can be useful for the programmer who wants to port IBM AIX-based source code to the Solaris OS / SPARC platform.

Memory Model

The SPARC Version 9 (SPARC v9) specification defines three memory models, from least restrictive to most restrictive:

  • Relaxed Memory Order (RMO). There are no ordering constraints on memory references outside of those required for processor self-consistency. Where ordering is required, the developer must explicitly design and code for it by using the membar instruction.
  • Partial Store Order (PSO). This has all of the requirements of RMO plus loads are ordered with respect to earlier loads, and atomic load-stores are ordered with respect to loads. A membar instruction is required to order stores and atomic load-stores with respect to each other.
  • Total Store Order (TSO). This has all the requirements of PSO plus stores are ordered with respect to earlier stores, and atomic load-stores are ordered with respect to loads and stores.

The SPARC architecture provides multiple memory models for two reasons: so that implementations can schedule memory operations for higher performance, and so that programmers can create synchronization primitives using shared memory. The less-restrictive memory models -- RMO and PSO -- afford more opportunities for application performance improvements by the processor.

An idealized SPARC v9 processor has the form shown in Figure 1.

SPARC processor
Figure 1. SPARC v9 Processor
 

The processor's issue unit reads instructions from memory and issues them in program order. Program order is the order determined by the control flow of the application under the assumption that each instruction executes independently and sequentially.

The reorder unit gathers those issued instructions for dispatch to the execute unit. The reorder unit allows an implementation to rearrange instructions to perform some instructions in parallel for greater efficiency.The reordering is constrained to maintain program order.

The execute unit carries out the instruction and writes results to a buffer unit.

The buffer unit schedules write operations to memory. The presence of the buffer unit frees the execute unit from delays incurred by writing to memory. The buffer unit can also respond to load requests for a memory address when it holds the contents of the address from a previous store. This introduces a potential for inconsistency -- a write request to memory can be present in the buffer unit, reloaded by the issue unit from the buffer unit, modified by the execute unit again and the write request enqueued yet again before a write to memory by the buffer unit. Although this can in theory produce an inconsistency between processes in a single-processor system, this inconsistency does not in fact occur, because the actions of a process-context switch include a buffer unit flush to memory.

In a multiprocessor system, an inconsistency can occur as a result of the presence of the buffer unit. The inconsistency will arise when memory shared between processes is modified by their respective processors but unwritten to memory -- that is, when the modified values are present in the buffer unit of more than one processor. Figure 2 illustrates a multiprocessor system.

SPARC multiprocessor system
Figure 2. SPARC v9 Multiprocessor System
 
 
Membar Instruction

The SPARC v9 architecture includes the membar instruction. The term membar is a contraction of memory barrier. The membar has two variants:

  • Ordering gives the programmer control over the order of loads and stores issued by a processor.
  • Sequencing gives the programmer control over the order and completion of memory operations.

The ordering membar instruction imposes an order on the instruction stream of one processor. Loads and stores that appear before the membar instruction are ordered with respect to those appearing after the membar. The atomic instructions are ordered as if they are both a load and a store, because they perform both operations. The instruction contains four bit-encoded ordering relationships:

  • #LoadLoad
  • #StoreLoad
  • #LoadStore
  • #StoreStore

The semantics of each #XY relationship is as follows: "All X operations that appear before the membar in program order complete before any Y operations that follow after the membar in program order." More complex ordering requirements can be created by combining relationships using the bit-wise or operator.

The SPARC Version 8 (SPARC v8) stbar instruction is a subset of membar, equivalent to a membar with an ordering relationship of #StoreStore.

Depending on the memory model under which the system is operating, the programmer must explicitly insert memory-ordering instructions to guarantee program correctness. For example, the SPARC assembler code to release a lock using the store unsigned byte (stub) instruction is as follows:

	unlock_ldstub:
		nop                                ! TSO or 
		membar	#StoreStore                ! PSO or
		membar	#StoreStore | #LoadStore   ! RMO
		stub	%g0,[%o0]
		retl
		nop
 

Remember that a lock protects some variable that will be modified after the lock is acquired and before the lock is released. The modification of the variable implies a store to memory by the program. In PSO mode, the #StoreStore will cause the store to the variable to be ordered with respect to the store to the lock. If the order was not imposed, the lock could appear to be free before the variable had been updated in memory.

Those who follow the defensive school of programming will develop code assuming the least restrictive model -- RMO -- because this will execute correctly when using the other two SPARC memory models. The defensive school would also gather all synchronization primitives into a common system library.

Atomic Instructions

SPARC machine instructions are normally executed to completion without interruption. This includes the memory access instructions of load and store. In multiprocessor-multicore systems, two or more processes executing an instruction using the same memory address are guaranteed to occur in a serial but undefined order. This guarantees memory consistency, but the order of operations is undefined, as are the memory contents after the operations complete.

Atomic instructions act like both a load and a store, extending the "without interruption" requirement to include both operations. These instructions allow the creation of multithreaded and cooperating multiprocess applications that take advantage of the concurrency offered by today's high-performance systems.

SPARC Version 9 (v9) has three atomic instructions:

Load-Store Unsigned Byte: ldstub

The load-store unsigned byte (ldstub) instruction was the original atomic primitive used in the Solaris OS to implement mutual exclusion locks -- mutex -- and other thread synchronization primitives. The instruction writes 0xff into a byte and returns the old value in a register. To acquire a lock, a caller used ldstub and then inspected the value returned. If the returned value was zero, the lock had then been acquired and the lock byte now contained 0xff. If the return value was 0xff, the lock was held by a previous caller. The caller would then execute some adaptive algorithm to wait until the lock became available -- that is, its value became or returned to zero.

Here is the algorithm for the instruction, presented in a C-like pseudocode:

int8_t
ldstub( int8_t *lock_byte ) {
	int8_t   old_value;

	atomic {
	    old_value  = *lock_byte;
	    *lock_byte = 0xff;
	}

	return( old_value );
}
 

The ldstub is the classic test-and-set instruction. The shortcoming of the instruction is that it has a consensus number of two and hence cannot resolve more than two contending processes in a wait-free fashion.

Swap Register With Memory: swap

The swap register with memory (swap) instruction exchanges the contents of a word in memory with a register. According to the SPARC v9 manual, the swap instruction is deprecated, and programmers should use the casxa instruction in its place.

The algorithm is as follows:

int32_t
swap( int32_t *word, int32_t new_value ) {
	int32_t   old_value;

	atomic {
	    old_value = *word;
	    *word     = new_value;
	}

	return( old_value );
}
 

Similar to the ldstub instruction, swap has a consensus number of two.

Compare and Swap: cas

The SPARC v9 manual introduced the newest atomic instruction: compare and swap (cas) at section 8.4.6.1.The instruction uses a memory address and two registers. It compares a word in memory with a register. If these are equal, the instruction swaps the contents of the word in memory with a second register.

The instruction has an infinite consensus number: It can resolve an infinite number of contending processes in a wait-free fashion. You can use cas for so-called lock-free operations -- linked-list management is the classic example. The term lock-free is somewhat of a misnomer. The process still uses locking, but the locking takes place in hardware instead of the more commonly coded mutual-exclusion lock.

Here is the algorithm for cas:

int64_t
cas64( int64_t *word, int64_t test_value, int64_t new_value ) {
	int64_t   old_value;

	atomic {
	    old_value	= *word;

	    if ( *word == test_value )
	        *word= new_value;
	}

	return( old_value );
}
 

To determine whether a swap took place, compare the return value in the second register with the test value used in the first register. Here is what this looks like in this article's pseudocode:

if ( cas64( &cas_word, testV, newV ) == testV )
	/* swap took place */
 

You can use a simple performance enhancement when you write this in SPARC assembler.

Solaris OS Interfaces

The Solaris OS has many interfaces for use in concurrent and multiprocessor systems. These Solaris thread interfaces date from at least Solaris 2.4 (March 1993). The POSIX thread interfaces were added in Solaris 2.5 (June 1995). These are often referred to as the Pthread interfaces and originally appeared in the 3T manual section. In Solaris 10, both Solaris and POSIX thread interfaces are documented in the Multithreaded Programming Guide.

The Pthread interfaces have been extended over the years to remain POSIX compliant and to add functionality that is not part of the POSIX standard. The thread library has been rewritten at least once. This rewrite harmonized the Solaris OS and POSIX interfaces, integrated the interfaces into libc, and improved performance.

Most vendors have adapted the POSIX thread interfaces, so porting applications between vendor platforms is largely a matter of recompiling for any thread interfaces. However, a simple recompile will not solve one class of problems. If the application used synchronization primitives that were vendor specific, then porting the application requires more than a recompile. In some cases, this may be trivial. For example, a simple interface mapping can be made from Solaris OS to POSIX thread interfaces with almost perfect fidelity. But once again, some interfaces will not be trivial to implement.

IBM AIX Interfaces

The IBM AIX platform has offered a few vendor-specific synchronization primitives since version 3.2 was released in 1992. The paper "Turning the AIX Operating System Into an MP-Capable OS" by Jacques Talbot provides a good background on the PowerPC processor and synchronization under the AIX operating system. The paper is available from the USENIX 1995 Technical Conference Proceedings under the Potpourri II section.

The synchronization primitives in question are the following:

	#include   <sys/atomic_op.h>

	int         fetch_and_add( atomic_p word_addr, int value );
	uint        fetch_and_or( atomic_p word_addr, int mask );
	uint        fetch_and_and( atomic_p word_addr, int mask );
	void        _clear_lock( atomic_p word_addr, int value );
	boolean_t   _check_lock( atomic_p word_addr, int old_val, int new_val );
	boolean_t   compare_and_swap( atomic_p word_addr, int *old_val_addr, int new_val );
	boolean_t   test_and_set( atomic_p word_addr, int mask );
 

Knowing the SPARC atomic instructions and a little SPARC assembler, you can implement the seven AIX synchronization primitives for the Solaris OS. This should result in a compatibility library for porting AIX applications to the Solaris platform. The library will be implemented for Solaris / SPARC v9 platforms running in 32-bit mode. The programmer community will have to provide SPARC 64-bit, AMD, and Intel implementations.

Each of the interfaces in this section will be presented in a C-like pseudocode and implemented in SPARC assembler. All of the interfaces are leaf routines and as such can take advantage of the leaf procedure optimizations as described in sections D.5 and H.1.2 of the SPARC Architecture Manual v8 (PDF).

The fetch_and_add, fetch_and_or, and fetch_and_and Interfaces

The fetch_and_OP interfaces all follow the same algorithm, substituting the specific operation where required. The SPARC platform does not operate directly on memory -- all operations are register-based with the exception of load, store, and the atomic instructions. This requires the fetch_and_OP interfaces to employ a cas instruction within a loop. Once the expected result is computed within the loop, the cas is used to attempt the update. If the update fails, the loop continues.

Here is the algorithm:

int
fetch_and_OP( atomic_p word_addr, int value ) {
	int   result;

	atomic {
	    result       = *word_addr;
	    *word_addr OP= value;

	}
	return( result );
}
 

In SPARC assembler, this becomes the following:

	fetch_and_OP:
	      ld        [%o0],%g1         ! load the current value
	loop: OP        %g1,%o1,%o2       ! compute the desired result
	      cas       [%o0],%g1,%o2     ! try to CAS it into place
	      cmp       %g1,%o2
	      bne,a,pn  %icc,loop         ! CAS failed, try again
	      mov       %o2,%g1           ! save current value for next iteration
	      retl
	      mov       %o2,%o0           ! return the old value
 

All that remains to complete the three interfaces is the substitution of and, or, or add for the OP placeholder instruction.

Note that two conditional branch features are used: annul and predictive. The annul ",a" will skip the mov instruction in the delay slot if the branch is not taken. The predictive ",pn" is a hint to the processor that the conditional branch will likely not be taken -- that is, the processor should assume low contention.

Also, a performance enhancement is made possible by the return value of the cas instruction. Rather than employing a load to read the word from memory, the value returned by the cas by way of the second register is used.

The _clear_lock and _check_lock Interfaces

These two interfaces update a lock word in an atomic manner. The _clear_lock interface simply sets the value of the lock to a given value -- essentially a runtime initialization. The _check_lock interface conditionally updates the lock word with a new value. The interface _check_lock is a compare-and-swap type of operation.

The algorithms are as follows:

void
_clear_lock( atomic_p word_addr, int value ) {
	*word_addr	= value;
	return;
}

boolean_t
_check_lock( atomic_p *word_addr, int old_val, int new_val ) {
	int	result;

	atomic {
	    if ( *word_addr == old_val )
	        *word_addr  = new_val;
	        result      = FALSE;
	    } else
	        result      = TRUE;
	}
	return( result );
}
 

Because each of these interfaces is intended to be used for synchronization, a memory barrier will be required. Here is the SPARC assembler for them:

	_clear_lock:
	        membar  #StoreStore|#LoadStore  ! memory barrier (RMO)
	        st      %o1,[%o0]               ! store the word
        	retl
	        nop

	_check_lock:
	        cas     [%o0],%o1,%o2           ! try the CAS
	        cmp     %o1,%o2
	        mov     0,%o0                   ! assume it succeeded - return FALSE/0
	        movne   %icc,1,%o0              ! may have failed - return TRUE/1
	        membar  #LoadLoad|#LoadStore    ! memory barrier (RMO)
	        retl
	        nop
 

Of note in the _check_lock interface is the use of a conditional move instruction, specifically movne. The conditional move uses the integer condition codes, by the use of %icc, to move 1 to the return register only if the cas failed as detected by the cmp instruction.

The compare_and_swap Interface

The compare_and_swap interface directly maps to the SPARC v9 cas instruction with two implementation requirements: The return value is a boolean_t, and the old value is returned by way of the second argument if the swap did not take place.

boolean_t
compare_and_swap( atomic_p word_addr, int *old_val_addr, int new_val ) {
	int         oldV    = *old_val_addr;
	boolean_t   result;

	atomic {
	    if ( *word_addr == oldV ) {
	        *word_addr    = new_val;
	        result        = TRUE;
	    } else {
	        *old_val_addr = word_addr;
	        result        = FALSE;
	    }
	}
	return( result );
}
 

The SPARC assembler for the interface is a straightforward mapping to the cas instruction with the appropriate argument and return value management.

	compare_and_swap:
	      ld    [%o1],%g1       ! set the old value
	      cas   [%o0],%g1,%o2   ! try the CAS
	      cmp   %g1,%o2
	      be,a  true
	      mov   1,%o0           ! return TRUE/1
	      mov   0,%o0           ! return FALSE/0
	      st    %o2,[%o1]       ! store existing value in memory
	true: retl
	      nop
 

The only item of note is the return handling, which is written using a conditional branch with the annul bit. If the swap took place, the branch will be taken and the mov 1,%o0 will be executed in the delay slot. If the branch is not taken, the annul bit will cause the first move instruction to be skipped, and execution will continue with the mov 0,%o0 instruction.

The test_and_set Interface

This interface is a bitwise test-and-set operation. A bitwise or of the mask and contents of the word in memory are the test part of the operation. If no bits in mask are set in the word, then mask is added to the word using a bit-wise or operation -- this is the set part of the operation.

boolean_t
test_and_set( atomic_p word_addr, int mask ) {
	boolean_t   result;

	atomic {
	    if ( *word_addr & mask )
	        result   = FALSE;
	    else {
	        *word_addr  |= mask;
	        result       = TRUE;
	    }
	}
	return( result );
}
 

The implementation will use a loop, but not because the interface is required to succeed before returning -- that is not a requirement. The loop is required because of a race-condition between the if ( ... ) and the *word_addr |= mask steps. To illustrate, here is the pseudocode rewritten without the atomic section:

boolean_t
test_and_set( atomic_p word_addr, int mask ) {
	boolean_t   result;

	loop {
	    int oldV   = *word_addr;

	    if ( oldV & mask ) {        /* test step */
	        result = FALSE;
	        break;
	    } else {
	        int   newV = oldV | mask;

	        if ( cas32( word_addr, oldV, newV ) == oldV ) {
	            result = TRUE;
	            break;
	        } else         /* *word_addr changed between test step and cas */
	            continue;
	    }
	}
	return( result );
}
 

The preceding form assumes the existence of a cas32 function that includes an atomic section. This is a safe assumption because the function can be written as follows:

	cas32:
	    cas    [%o0],%o1,%o2
	    retl
	    mov    %o2,%o0
 

And the function is atomic by definition.

This second form of test_and_set as shown earlier closely maps to SPARC assembler and is as follows:

	test_and_set:
	       ld     [%o0],%g1       ! load the current value
	loop:  andcc  %g1,%o1,%g0     ! test mask against value
	       bnz,a  done
	       mov    0,%o0           ! return FALSE/0
	       or     %g1,%o1,%o2     ! compute the desired result
	       cas    [%o0],%g1,%o2   ! try the CAS
	       cmp    %g1,%o2
	       be,a   done
	       mov    1,%o0           ! return TRUE/1
	       ba     loop            ! try again
	       mov    %o2,%g1         ! save current value for next iteration
	done:  retl
	       nop
 
 
Conclusion

This article has provided an overview of the SPARC v9 processor memory model and atomic instructions as they pertain to multiprocessor systems and shared memory applications. It also provided a Solaris OS implementation of several IBM AIX interfaces that you can use to aid in porting AIX-based applications to the Solaris OS.

Acknowledgments

This article could not have been written without the help of java.sun.com manager Jill Welch and managing editor Christine Dorffi.

For More Information
Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.