This document is available in PDF.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
b. accept
The accept function is used to accept an incoming client connection request.
On a connect request, Speed accept is called since it interposes the libsocket.so's accept.
The Speed library first calls the libsocket.so accept to actually create a TCP connection. Once a successful connection
has been established, the socket descriptor is stored in a Speed data structure, and synchronization variables are initialized.
c. read
The read on the server side is a consumer of the client-write data. When
the server tries to read data on a file descriptor, the Speed library read
function is called because it is interposed. A check is made to see whether the
file descriptor matches the established file descriptor and, if so, the read
function waits on the semaphore occupied_r. For all other file descriptors, the
Speed library function transfers control to the libc.so read.
When the client writes some data, the data is transferred using doors
IPC into the server process. The doors service in the server process
identifies whether the operation is a read or write. A sema_post operation
is executed on the occupied_r semaphore, and a sema_wait is executed on
empty_r. The sema_post wakes up the read thread. The read thread copies the data using bcopy and wakes up the door service thread using sema_post on the empty_r semaphore.
door_service(void *cookie, char *argp, size_t arg_size,
door_desc_t*dp, uint_t n_descriptors)
{
...
} else if (ptr->type == WRITE) {
1 fd = ports[ptr->port];
SEMA_WAIT(&pmap[fd].empty_r);
2 bcopy(ptr->buf, pmap[fd].rbuf, ptr->size);
sema_post(&pmap[fd].occupied_r);
}
...
}
|
read(int fildes, char* buf, size_t nbyte )
{
...
1 if (fildes > 0 && pmap[fildes].fd == fildes) {
2 SEMA_WAIT(&pmap[fildes].occupied_r);
bcopy(pmap[fildes].rbuf, buf, nbyte);
sema_post(&pmap[fildes].empty_r);
return nbyte;
}
...
}
|
d. write
The write function on the server side is a producer of
the client-read data. When the server tries to write data
on a file descriptor, the Speed function write is called
since it is interposed. A check is made to see whether the
file descriptor matches the established connection file descriptor,
and, if so, the write function waits on the semaphore empty_w.
If successful, the data is copied to the Speed buffer and
sema_post is executed on occupied_w.
When the client tries to read data, a fast context switch is done into the server
process using doors IPC. The doors service in the server process identifies if the
operation is a read or write, and a sema_wait operation is executed
on the occupied_w semaphore. The sema_post on
occupied_w by the write thread wakes up the door_service
thread, and the data in the Speed buffer is transferred to the client-read buffer.
A sema_post is executed on the empty_w semaphore.
ssize_t write(int fildes, const void *buf, size_t nbyte)
{
...
if (fildes > 0 && pmap[fildes].fd == fildes) {
1 SEMA_WAIT(&pmap[fildes].empty_w);
2 bcopy(buf, pmap[fildes].wbuf, nbyte);
sema_post(&pmap[fildes].occupied_w);
return nbyte;
}
...
}
|
door_service( void *cookie, char *argp, size_t arg_size,
door_desc_t*dp, uint_t n_descriptors)
{
...
if (ptr->type == READ) {
1 fd = ports[ptr->port];
SEMA_WAIT(&pmap[fd].occupied_w);
2 bcopy(pmap[fd].wbuf, ptr->buf, ptr->size);
sema_post(&pmap[fd].empty_w);
door_return((char*)ptr->buf, ptr->size, NULL, 0);
}
...
}
|
a. connect
The client establishes a connection to the server using the connect
function. Since the connect symbol is interposed, the Speed version of
connect gets control. A connection is established to the server using
the doors IPC. The libsocket.so connect is called to establish a real
connection. If the connection is successful, the necessary Speed data
structures are created.
int connect(int s, const struct sockaddr *addr, socklen_t addrlen)
{
...
1 if (fptr == 0) {
if ((door_fd=open(NAME_SERVICE_DOOR, O_RDONLY)) < 0) {
perror("Open bogus"), exit(1);
}
info.di_target=0;
if (door_info(door_fd, &info) < 0 ){
perror("Door_info");
printf("errno=%dn", errno);
exit(1);
}
fptr = (int (*)())dlsym(rtld_next, "connect");
if (fptr == null) {
(void) printf("dlopen: %sn", dlerror());
return (0);
}
}
2 dinfo[s].fd = s;
ret = ((*fptr)(s, addr, addrlen));
3 if (ret != -1) {
slen = sizeof(client);
getsockname(s, (struct sockaddr *)&client, &slen);
dinfo[s].port = client.sin_port;
}
return ret;
}
b. read
When the client calls read to get data from the server,
the Speed version of read is called. A check is made to
ensure that the file descriptor matches the established connection,
and a fast context switch is made into the server door_service.
On return, the server data is copied into the client buffer using bcopy.
ssize_t read(int fildes, void *buf, size_t nbyte)
{
...
1 if (dinfo[fildes].fd > 0 && dinfo[fildes].fd == fildes) {
dinfo[fildes].port));
dinfo[fildes].read.fd=dinfo[fildes].fd;
dinfo[fildes].read.port=dinfo[fildes].port;
dinfo[fildes].read.buf[0] = '0';
dinfo[fildes].read.size=nbyte;
dinfo[fildes].read.type=READ;
dinfo[fildes].darg_r.data_ptr = (char*)&dinfo[fildes].read;
dinfo[fildes].darg_r.data_size = PADSIZE + nbyte + 1;
dinfo[fildes].darg_r.desc_ptr = NULL;
dinfo[fildes].darg_r.desc_num = 0;
dinfo[fildes].darg_r.rbuf = (char*)dinfo[fildes].read.buf;
dinfo[fildes].darg_r.rsize = nbyte;
door_call(door_fd, &dinfo[fildes].darg_r);
2 bcopy(dinfo[fildes].read.buf, buf, nbyte);
return nbyte;
}
...
}
c. write
When the client calls write to send data to the server, the Speed
write is called. A check is made to ensure that the file descriptor
matches the established connection. If so, the write data is bcopied
to a Speed buffer and a fast context switch is made into the server
door_service to wake up the waiting server read thread.
ssize_t write(int fildes, const void *buf, size_t nbyte)
{
...
1 if (dinfo[fildes].fd > 0 && dinfo[fildes].fd == fildes) {
bcopy(buf, dinfo[fildes].write.buf, nbyte);
dinfo[fildes].write.fd=dinfo[fildes].fd;
dinfo[fildes].write.port=dinfo[fildes].port;
dinfo[fildes].write.size=nbyte;
dinfo[fildes].write.type=WRITE;
dinfo[fildes].darg_w.data_ptr = (char *)&dinfo[fildes].write;
dinfo[fildes].darg_w.data_size = PADSIZE + nbyte + 1;
dinfo[fildes].darg_w.desc_ptr = NULL;
dinfo[fildes].darg_w.desc_num = 0;
dinfo[fildes].darg_w.rbuf = (char*)dinfo[fildes].write.buf;
dinfo[fildes].darg_w.rsize = nbyte ;
door_call(door_fd, &dinfo[fildes].darg_w);
2 return nbyte;
}
...
}
The implementation of the Speed library with doors and memory maps is more
complex, as data is copied into a memory mapped (mmap(2)) buffer
to avoid making multiple copies of the data. For this implementation, a
sliding window type of buffer management has been adopted. For every connection, the
server creates a shared memory mapped segment. This segment is
divided into multiple windows. Each window is further divided
into slots and the number of slots and the slot sizes are configurable.
The libsocket.so accept is no longer called for loopback connections,
but it is simulated. However, libsocket.so accept is called for
connections coming across the network. The connections are automatically
pooled. This was done to re-use the memory map segments instead of creating
them for every connection. The server caches the connection and, if a client
re-connects, a connection is returned from the pool.
Data is now directly bcopied into an available slot in the
memory mapped segment. The doors IPC is used only to make a fast
context switch into the server process. This makes doors extremely
lightweight, resulting in very fast context switch times. The data
consumption is still based on the producer/consumer model. The producer
now has more slots to copy the data, as the memory mapped segment is divided
into windows and slots.
Server Side Functions
a. bind
The bind function operates as in implementation I, creating a new
door service. It initializes buffer management variables and calls the
libsocket.so bind to bind the name.
int bind(int s, const struct sockaddr *addr, socklen_t addrlen)
{
...
1 if (fptr == 0) {
cptr = ( struct sockaddr_in*) addr;
if ((did = door_create(server, DOOR_COOKIE, DOOR_UNREF)) < 0) {
perror("door_create");
return -1;
}
sprintf(bptr, "%s%d", name_service_door, cptr->sin_port);
unlink(bptr);
mask = umask(0);
dfd = open(bptr, O_RDONLY|O_CREAT|O_EXCL|O_TRUNC, 0644);
umask(mask);
if (fattach(did, bptr) < 0 ) {
perror("fattach");
return -1;
}
2 accept_block = FALSE;
if (getenv("SPEED_ACCEPT_BLOCK") != 0)
accept_block = TRUE;
mutex_init(&connect_m, USYNC_THREAD, NULL);
mutex_init(&used_doors.access, USYNC_THREAD, NULL);
used_doors.front = MAX_FDS;
used_doors.number = 0;
mutex_init(&open_doors.access, USYNC_THREAD, NULL);
open_doors.index = 0;
open_doors.open = 0;
/* BUFSIZE = 8192, 8192 / 2 for r, and w, /winsz for number
of wins */
bptr = (char*)getenv("SPEED_NOWINS");
if (bptr == NULL)
tparams.nowins = NOWINS;
else
tparams.nowins = atoi(bptr);
if (tparams.nowins <= 0)
tparams.nowins = nowins;
if ((bptr = (char*)getenv("SPEED_WINSIZE")) == (char*)null)
tparams.winsz = bufsize/4;
else {
tparams.winsz = atoi(bptr);
}
if (tparams.winsz <= 0)
tparams.winsz = bufsize/4;
tparams.bufsize = tparams.winsz * tparams.nowins * full_duplex;
tparams.duplex = full_duplex;
pagesize = getpagesize();
if (pagesize < bufsize)
pagesize = bufsize;
tparams.pagesize = pagesize;
if (tparams.pagesize < (window_attr_sz * 3 * tparams.nowins))
tparams.pagesize = (window_attr_sz * 3 * tparams.nowins);
tparams.pagesize += window_mgmt_sz;
tparams.pagesize += (pagesize - (tparams.pagesize % pagesize));
tparams.mmap_sz = (tparams.winsz * tparams.nowins *
(tparams.duplex+1)) + tparams.pagesize;
fptr = (int (*)())dlsym(rtld_next, "bind");
if (fptr == null) {
debug(fprintf(stderr, "dlopen: %sn", dlerror()));
return (0);
}
sema_init(&accept_p_s, 1, usync_thread, 0);
sema_init(&accept_r_s, 0, usync_thread, 0);
closed_door_q.max_elems = max_fds;
closed_door_q.first_elem = 0;
closed_door_q.last_elem = 0;
closed_door_q.no_elems = 0;
}
3 return ((*fptr)(s, addr, addrlen));
}
b. accept
The accept function is now simulated for loopback connections.
A producer/consumer paradigm is again employed. The door_service
function is the producer of the connections, and the accept function
is the consumer of these connections. The door_service function
produces connections on requests from loopback clients.
The accept function now waits on the semaphore accept_r_s for
a client connection. When a client tries to establish a loopback
connection, a fast context switch is made using doors IPC into the
door_service on the server. The memory mapped structures are
created if it is a new connection, and a sema_post is executed on
accept_r_s by the door_service thread. This wakes up the accept
thread, and a successful connection is created. The TCP ephemeral
port is also simulated.
void door_service(void *cookie, char *argp, size_t arg_size,
door_desc_t*dp,uint_t n_descriptors)
{
...
} else if (ptr->type == CONNECT) {
1 client_doorinfo *ptr = (client_doorinfo*)argp;
size = ptr->size;
mutex_lock(&connect_m); /* At the moment connect requests */
/* are serialized, slowing down this segment */
while(sema_wait(&accept_p_s));
2 connect_port = -1;
if (ptr->port > 0)
connect_port = ptr->port;
accept_fd = socket(AF_INET, SOCK_STREAM, 0);
if (connect_port == -1) {
connect_port = port_avail;
port_avail++;
port_avail %= szshort;
}
3 accept_fd = door_accept(accept_fd, &client,
sizeof(client), 1);
if (accept_fd == -1) {
ptr->port = -1;
sema_post(&accept_p_s);
mutex_unlock(&connect_m);
door_return((char*)ptr, size, NULL, 0);
}
4 ptr->port = client.sin_port;
pmap[fd].state = INUSE;
doconnect(accept_fd, (client_doorinfo*)ptr);
accept_count++;
sema_post(&accept_r_s);
mutex_unlock(&connect_m);
door_return((char*)ptr, size, NULL, 0);
|
int accept(int s, struct sockaddr *addr, Psocklen_t addrlen)
{
...
1 for (;;) {
if (sema_wait(&accept_r_s)) {
for (j=0; j<100; j++);
} else
break;
} 2 accept_count--; client = (struct sockaddr_in *)addr;
client->sin_addr.s_addr = htonl(INADDR_LOOPBACK);
client->sin_family = AF_INET;
client->sin_port = htons(connect_port);
fildes = accept_fd;
sema_post(&accept_p_s);
return fildes; ...
|
c. read
The server read functions as in Code Example 2 and is a
consumer of the client-write data. Since a sliding window
type of protocol is used, some calculation is required to
find the correct window and the correct slot in the window.
The read function waits on the rd_occupied semaphore. When
the client writes data, it is copied into a memory mapped slot,
and a fast context switch is made into the door_service on the
server. The door service does a sema_wait on the rd_empty
semaphore and, if successful, executes a sema_post operation
on the rd_occupied semaphore. The sema_post wakes up the read thread, and the read thread copies the data using bcopy
and executes a sema_post on the rd_empty semaphore.
ssize_t read(int fd, void *buf, size_t nbyte)
{
...
if (fd > 0 && pmap[fd].fd == fd) {
1 w_mgmt_ptr = pmap[fd].r_w_mgmt_ptr;
if (pmap[fd].partial_read_flag == 0) {
rd_occupied--;
while(sema_wait(&pmap[fd].rd_occupied));
}
2 win = w_mgmt_ptr[SERVER_ACTIVE_WIN];
w_attr_ptr = (int*)(pmap[fd].r_w_attr_ptr_offset +
WINDOW_INDEX(win));
mptr = pmap[fd].r_mptr;
w_dptr = mptr + w_attr_ptr[DBUF_OFFSET];
w_dptr = w_dptr + w_attr_ptr[START_ADDR];
3 if (nbyte <= w_attr_ptr[csz]) {
bcopy(w_dptr, buf, nbyte);
w_attr_ptr[start_addr] = nbyte ;
w_attr_ptr[csz] = w_attr_ptr[csz] - nbyte ;
} else if (nbyte > w_attr_ptr[CSZ]) {
bcopy(w_dptr, buf, w_attr_ptr[CSZ]);
nbyte = w_attr_ptr[CSZ];
w_attr_ptr[CSZ] = 0;
}
4 if (w_attr_ptr[CSZ] == 0) {
w_attr_ptr[START_ADDR] = 0 ;
w_mgmt_ptr[SERVER_ACTIVE_WIN]++;
w_mgmt_ptr[SERVER_ACTIVE_WIN]
w_mgmt_ptr[SERVER_ACTIVE_WIN]
% tparams.nowins;
rd_empty++;
pmap[fd].partial_read_flag = 0;
sema_post(&pmap[fd].rd_empty);
}else {
pmap[fd].partial_read_flag = 1;
}
...
}
void door_service(void *cookie, char *argp,
size_t arg_size, door_desc_t*dp, uint_t n_descriptors)
{
...
} else if (ptr->type == WRITE ) {
...
1 while(sema_wait(&pmap[fd].rd_empty));
mptr = (int*)pmap[fd].mdoor.mptr;
w_mgmt_ptr = mptr + WINDOW_MGMT_BEGIN;
w_mgmt_ptr[CLIENT_ACTIVE_WIN]++;
w_mgmt_ptr[CLIENT_ACTIVE_WIN] =
w_mgmt_ptr[CLIENT_ACTIVE_WIN] % tparams.nowins;
sema_post(&pmap[fd].rd_occupied);
...
}
d. write
The write also functions as in implementation I and
is a producer of client-read data. The write function
waits on a wr_empty semaphore and, if successful, bcopies
data into a memory mapped slot. It executes a sema_post
on the wr_occupied semaphore to wake up the door_service
thread. When the client tries to read some data, a fast
context switch is made into the door_service on the server,
and a sema_wait is executed on the wr_occupied semaphore.
If successful, a sema_post is executed on the wr_empty
semaphore.
ssize_t write(int fd, const void *buf, size_t nbyte)
{
...
if (fd > 0 && pmap[fd].fd == fd) {
1 cbuf = (void*)buf;
csz = nbyte;
w_mgmt_ptr = pmap[fd].w_w_mgmt_ptr;
mptr = pmap[fd].w_mptr;
while(csz > 0) {
wr_empty--;
sema_ptr = (sema_t*)&pmap[fd].wr_empty;
while(sema_wait(&pmap[fd].wr_empty));
2 win = w_mgmt_ptr[CLIENT_ACTIVE_WIN];
w_attr_ptr = (int*) (pmap[fd].w_w_attr_ptr_offset +
WINDOW_INDEX(win));
w_dptr = mptr + w_attr_ptr[DBUF_OFFSET];
if (csz <= w_attr_ptr[sz]) {
bcopy(cbuf, w_dptr, csz);
w_attr_ptr[csz] = csz;
cbuf = ((char*)cbuf) + csz;
csz = 0;
} else if (csz > w_attr_ptr[SZ]) {
bcopy(cbuf, w_dptr, w_attr_ptr[SZ]);
w_attr_ptr[CSZ] = w_attr_ptr[SZ];
csz = csz - w_attr_ptr[SZ];
cbuf = ((char*)cbuf) + w_attr_ptr[SZ];
}
w_mgmt_ptr[CLIENT_ACTIVE_WIN]++;
w_mgmt_ptr[CLIENT_ACTIVE_WIN] =
w_mgmt_ptr[CLIENT_ACTIVE_WIN] %
tparams.nowins;
wr_occupied++;
sema_ptr = (sema_t*)&pmap[fd].wr_occupied;
sema_post(&pmap[fd].wr_occupied);
}
...
}
void door_service(void *cookie, char *argp, size_t arg_size,
door_desc_t*dp, uint_t n_descriptors)
{
if (ptr->type == READ) {
while(sema_wait(&pmap[fd].wr_occupied));
...
sema_post(&pmap[fd].wr_empty);
door_return((char*)&ptr->ret, sizeof(int), NULL, 0);
}
Client Side Functions
a. connect
The connect function does a fast context switch to
the door_service to set up a connection with the server.
See the example of the server side accept function in
Code Example 8. On the return from the door service, the
shared memory mapped segment is mapped into the client address
space. The connect function caches client connections, and if
the client reconnects, it sends the cached descriptor to
the server to reestablish the connection.
b. read
The read function is similar to the server read function,
except it is on the client-side. The read function does a fast
context switch to the door_service on the server and waits for
server-write data. On return from the door call, the data is
bcopied to the client buffer from the memory mapped slot.
ssize_t read(int fildes, void *buf, size_t nbyte)
{
...
if (fildes > 0 && dinfo[fildes].fd == fildes) {
if (dinfo[fildes].partial_read_flag == 0) {
...
1 dinfo[fildes].rinfo.size=nbyte;
dinfo[fildes].rinfo.type=READ;
dinfo[fildes].rinfo.port = dinfo[fildes].port;
darg.data_ptr = (char *)&dinfo[fildes].rinfo;
darg.data_size = sizeof(readinfo);
darg.desc_ptr = NULL;
darg.desc_num = 0;
darg.rbuf = (char*)&dinfo[fildes].rinfo.ret;
darg.rsize = sizeof(int);
/* semapore block on occupied will happen
in the door server */
door_call(door_fd, &darg);
if (dinfo[fildes].rinfo.ret == -1) {
dinfo[fildes].state = CLOSE;
return 0;
}
if (dinfo[fildes].rinfo.ret > 0) {
dinfo[fildes].rinfo.nowins =
dinfo[fildes].rinfo.ret;
dinfo[fildes].rinfo.nowins--;
dinfo[fildes].state = IN_CLOSE;
}
}
}
2 mptr = dinfo[fildes].r_mptr;
win = w_mgmt_ptr[SERVER_ACTIVE_WIN];
w_attr_ptr = (int*)(dinfo[fildes].r_w_attr_ptr_offset +
WINDOW_INDEX(win));
w_dptr = mptr + w_attr_ptr[DBUF_OFFSET];
w_dptr = w_dptr + w_attr_ptr[START_ADDR];
if (nbyte <= w_attr_ptr[csz]) {
bcopy(w_dptr, buf, nbyte);
w_attr_ptr[start_addr] = nbyte ;
w_attr_ptr[csz] = w_attr_ptr[csz] - nbyte ;
} else if (nbyte > w_attr_ptr[CSZ]) {
bcopy(w_dptr, buf, w_attr_ptr[CSZ]);
nbyte = w_attr_ptr[CSZ];
w_attr_ptr[CSZ] = 0;
}
if (w_attr_ptr[CSZ] == 0) {
w_attr_ptr[START_ADDR] = 0 ;
w_mgmt_ptr[SERVER_ACTIVE_WIN]++;
w_mgmt_ptr[SERVER_ACTIVE_WIN] =
w_mgmt_ptr[SERVER_ACTIVE_WIN]
% tparams.nowins;
dinfo[fildes].partial_read_flag = 0;
}else {
dinfo[fildes].partial_read_flag = 1;
}
return nbyte;
}
}
c. write
The write is similar to the server-side write function. Client-data is bcopied to a memory mapped slot, a fast context switch is executed to enter the door_service function on the server, and the waiting server read thread is woken up.
ssize_t write(int fildes, const void *buf, size_t nbyte)
{
if (fildes > 0 && dinfo[fildes].fd == fildes) {
cbuf = (void*)buf;
csz = nbyte;
w_mgmt_ptr = dinfo[fildes].w_w_mgmt_ptr;
mptr = dinfo[fildes].w_mptr;
while(csz > 0) {
win = w_mgmt_ptr[CLIENT_ACTIVE_WIN];
w_attr_ptr = (int*) (dinfo[fildes].w_w_attr_ptr_offset +
WINDOW_INDEX(win));
w_dptr = mptr + w_attr_ptr[DBUF_OFFSET];
if (csz <= w_attr_ptr[sz]) {
bcopy(cbuf, w_dptr, csz);
w_attr_ptr[csz] = csz;
cbuf = ((char*)cbuf) + csz;
csz = 0;
} else if (csz > w_attr_ptr[SZ]) {
bcopy(cbuf, w_dptr, w_attr_ptr[SZ]);
w_attr_ptr[CSZ] = w_attr_ptr[SZ];
csz = csz - w_attr_ptr[SZ];
cbuf = ((char*)cbuf) + w_attr_ptr[SZ];
}
dinfo[fildes].winfo.size=nbyte;
dinfo[fildes].winfo.type=WRITE;
dinfo[fildes].winfo.port= dinfo[fildes].port;
darg.data_ptr = (char *)&dinfo[fildes].winfo;
darg.data_size = sizeof(writeinfo);
darg.desc_ptr = NULL;
darg.desc_num = 0;
darg.rbuf = NULL;
darg.rsize = 0;
door_call(door_fd, &darg);
}
return nbyte;
}
...
}
As mentioned earlier the memory mapped segment is divided into windows, and each window is divided into slots. The number of windows is not configurable at this time, but the number and size of slots are configurable through environment variables. Currently, the number of slots is limited to 152. This could be increased for better performance.
|
This implementation is similar to implementation II as discussed
above, but doors IPC is not used for context switching. Instead,
system-scope semaphores are created in the shared mmapped space,
and a sema_post is executed on these semaphores to signal data
availability. The bind, accept and connect functions have similar functionality.
Server Side Functions
a. read
The read function is similar to the read function
in Code Example 9. The server read thread
waits on a system-scope w_mgmt_ptr[SEMA_R_O] semaphore.
The client writes data directly into the mmapped slot,
and executes a sema_post on w_mgmt_ptr[SEMA_R_O] semaphore
to wake up the server read thread. The read thread executes
a bcopy to transfer the data from the memory mapped slot
into the server read buffer.
ssize_t read(int fd, void *buf, size_t nbyte)
{
...
if (fd > 0 && pmap[fd].fd == fd) {
1 w_mgmt_ptr = pmap[fd].r_w_mgmt_ptr;
if (pmap[fd].partial_read_flag == 0) {
sema_ptr = (sema_t*)&w_mgmt_ptr[SEMA_R_O];
while(sema_wait(sema_ptr));
}
2 win = w_mgmt_ptr[SERVER_ACTIVE_WIN];
w_attr_ptr = (int*)(pmap[fd].r_w_attr_ptr_offset +
WINDOW_INDEX(win));
mptr = pmap[fd].r_mptr;
w_dptr = mptr + w_attr_ptr[DBUF_OFFSET];
w_dptr = w_dptr + w_attr_ptr[START_ADDR];
if (nbyte <= w_attr_ptr[csz]) {
bcopy(w_dptr, buf, nbyte);
w_attr_ptr[start_addr] = nbyte ;
w_attr_ptr[csz] = w_attr_ptr[csz] - nbyte ;
} else if (nbyte > w_attr_ptr[CSZ]) {
bcopy(w_dptr, buf, w_attr_ptr[CSZ]);
nbyte = w_attr_ptr[CSZ];
w_attr_ptr[CSZ] = 0;
}
3 if (w_attr_ptr[CSZ] == 0) {
w_attr_ptr[START_ADDR] = 0 ;
w_mgmt_ptr[SERVER_ACTIVE_WIN]++;
w_mgmt_ptr[SERVER_ACTIVE_WIN] =
w_mgmt_ptr[SERVER_ACTIVE_WIN]
% tparams.nowins;
rd_empty++;
pmap[fd].partial_read_flag = 0;
sema_ptr = (sema_t*)&w_mgmt_ptr[SEMA_R_E];
sema_post(sema_ptr);
}else {
pmap[fd].partial_read_flag = 1;
}
return nbyte;
}
return ((*fptr)(fd, buf, nbyte));
}
b. write
The write function is similar to the write function
described in implementation II. The write thread waits
on a system-scope w_mgmt_ptr[SEMA_W_E] semaphore. If
successful, data is bcopied into the memory mapped slot,
and a sema_post is executed on the w_mgmt_ptr[SEMA_W_O]
semaphore to wake up the client read thread.
ssize_t write(int fd, const void *buf, size_t nbyte)
{
...
if (fd > 0 && pmap[fd].fd == fd) {
cbuf = (void*)buf;
csz = nbyte;
w_mgmt_ptr = pmap[fd].w_w_mgmt_ptr;
1 mptr = pmap[fd].w_mptr;
while(csz > 0) {
sema_ptr = (sema_t*)&w_mgmt_ptr[SEMA_W_E];
while(sema_wait(sema_ptr));
2 win = w_mgmt_ptr[CLIENT_ACTIVE_WIN];
w_attr_ptr = (int*) (pmap[fd].w_w_attr_ptr_offset +
WINDOW_INDEX(win));
w_dptr = mptr + w_attr_ptr[DBUF_OFFSET];
if (csz <= w_attr_ptr[sz]) {
bcopy(cbuf, w_dptr, csz);
w_attr_ptr[csz] = csz;
cbuf = ((char*)cbuf) + csz;
csz = 0;
} else if (csz > w_attr_ptr[SZ]) {
bcopy(cbuf, w_dptr, w_attr_ptr[SZ]);
w_attr_ptr[CSZ] = w_attr_ptr[SZ];
csz = csz - w_attr_ptr[SZ];
cbuf = ((char*)cbuf) + w_attr_ptr[SZ];
}
w_mgmt_ptr[CLIENT_ACTIVE_WIN]++;
w_mgmt_ptr[CLIENT_ACTIVE_WIN] =
w_mgmt_ptr[CLIENT_ACTIVE_WIN]
% tparams.nowins;
3 sema_ptr = (sema_t*)&w_mgmt_ptr[SEMA_W_O];
sema_post(sema_ptr);
}
...
}
Client Side
The read and write functions on the client
side are similar to the read and write functions
on the server side, as shown above.
Configuration Environment Variables
Just as in implementation II, the number and size of slots are configurable through environment variables. Currently, the number of slots is limited to 152. This could be increased for better performance.
|
To measure and compare the performance of the various
Speed implementations, several tests were carried out
using a publicly available multithreaded client-server
program. The programs are from "Multithreaded Programming
with PThreads"[6]. Some simple modifications were made to the
client.c and server_ms.c code. The socket file descriptor
was set to fcntl(NDELAY), fcntl(NOBLOCK), TCP_NODELAY to
ensure fast response times. TNF instrumentation was added
to get more accurate read and write latencies. The client-server
application was first run without the Speed library being
interposed, in other words, with libsocket.so only, and the
measurements were recorded. The application was then run
with the different implementations of the Speed library
interposed, and the measurements were again recorded. The
results are discussed in detail in the following sections.
Latency Measurements Using TNF Instrumentation
To obtain the latency of the read and write functions,
the client and server were executed separately with ktrace
and prex. The measurements were first recorded without the
Speed library and with libsocket.so only. The Speed library
implementations were interposed and the latencies were again
measured. The hardware used was an E450 with 400 Mhz, four
CPUs, two gigabytes of memory, running Solaris 8 (with no
updates) in 32-bit mode.
The client and server were run with all CPUs enabled and
with no processor set, and the times were captured for
different message sizes. Then a processor set was created
with two CPUs, with interrupts disabled in the set. The
server was run in this processor set; the server was started,
and all the LWPs were bound to this processor set by using
psrset -b [set] serverpid. The TNF latency data was tabulated
and is shown in the tables below.
Average Latency to Send and Receive a Message of 70 Bytes
Latency measurements on the server side
| ||||||||||||||||||||||||||||||||||||||||
Note:
1. Speed Implementation I had problems with TNF being interposed.
2. "client-server only" i.e., with just libsocket.so, without Speed library interposition.
3. The server was started with the following window configuration for measurements with Speed library interposed:
SPEED_WINSIZE=1024
SPEED_NOWINS=152
Latency measurements on the Client Side
| ||||||||||||||||||||||||||||||||||||||||
Note:
1. Speed Implementation I had problems with TNF being interposed.
2. "client-server only" i.e., with just libsocket.so, without Speed library interposition.
Average Latency to Send and Receive a Message of 512 Bytes
Server Side
| ||||||||||||||||||||||||||||||||||||||||
Note:
1. Speed Implementation I had problems with TNF being interposed.
2. "client-server only" i.e., with just libsocket.so, without Speed library interposition.
3. The server was started with the following window configuration for measurements with Speed library interposed:
SPEED_WINSIZE=1024
SPEED_NOWINS=152
Client Side
| ||||||||||||||||||||||||||||||||||||||||
Note:
1. Speed Implementation I had problems with TNF being interposed.
2. "client-server only" i.e., with just libsocket.so, without Speed library interposition.
Average Latency to Send and Receive a Message of 1000 Bytes
Server Side
| ||||||||||||||||||||||||||||||||||||||||
Note:
1. Speed Implementation I had problems with TNF being interposed.
2. "client-server only" i.e., with just libsocket.so, without Speed library interposition.
3. The server was started with the following window configuration for measurements with Speed library interposed:
SPEED_WINSIZE=1024
SPEED_NOWINS=152
Client Side
| ||||||||||||||||||||||||||||||||||||||||
Note:
1. Speed Implementation I had problems with TNF being interposed.
2. "client-server only" i.e., with just libsocket.so, without Speed library interposition.
Performance Measurements
The Speed library was recompiled without the TNF
instrumentation, and the performance of the client and
server were again measured without interposing the Speed
Library, in other words, with only libsocket.so. The
measurements were again made with the different Speed
libraries interposed. The timings were recorded with the
iobench routine, which is part of the client program. This
routine uses the proc system to get very accurate
measurements.
The client-server application was run with different CPU configurations and in a processor set to measure the optimum performance. The results were then tabulated. Only the best configurations are shown below.
Table Key:
Send and Receive 100,000 Messages of 70 Bytes
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Send and Receive 100,000 Messages of 512 Bytes
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Send and Receive 100,000 Messages of 1000 Bytes
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
bcopy Time
The time to bcopy 1000 bytes 100,000 times from a user
buffer to a memory mapped Speed buffer and vice versa, was
measured using the real-time function gethrtime. The
average time of the read and write operation is shown in
the table below. This was measured to estimate the actual
time spent copying data as opposed to the system-time
component, such as the time to context switch, time for sema
operations, and so forth. From this average time, the time
to bcopy 512 bytes and 70 bytes was deduced.
| ||||||||||||
Volano 1.0 Performance
Running the Volano Mark 1.0 with the Speed library interposed
boosts performance by 5x times. The Speed library does not
work with the newer version of Volano 2.X as the poll() call is
not yet supported by the library.
mmap to copy the data to
overcome the previous limitation. This seems to perform
well, as the doors IPC is used only to make a fast context
switch and to return status.bcopy time increases with an increase in data size
and approaches user time. The time required to bcopy 1000
byte messages 100,000 times is almost 1.9 seconds because
there are four bcopy operations in a full duplex
operation, which is a client-write to the server and a
client-read from the server. This is about 35% in Speed
Implementation III. Most of the remaining 65% is a constant
overhead of setting up the memory mapped space, connections,
and so forth.bcopy time
approaches user-time as message sizes increase. The
threshold when this happens needs to be measured.read and write, approaches the performance
with Speed library interposed. This is due to the system
being able to schedule the requests concurrently and squeeze
the wait-time. But at some threshold, the system will become
a bottleneck, as it may not be able to squeeze more
wait-time. This threshold needs to be studied.bcopy times by half. This needs to be
explored.
Interposing the TCP socket library with the Speed library
boosts client-server performance by more than 100%. Speed
Implementation II and III offer significant benefits over
TCP/IP for interprocess communication. TCP/IP does not
perform well with small message sizes, but performs well
with an increase in message sizes. Speed Implementation III
outperforms the other implementations, including TCP/IP on a
per CPU basis. In fact, the user time approaches bcopy
time with an increase in message size. At the moment, four
bcopies are needed for a successful read and write
operation, as data needs to be copied from the application
buffer to the Speed memory mapped buffer and back. This
can be reduced by half if an API is exposed, allowing the
client and server applications to write directly to the
memory mapped space instead of using read and write
calls. This could offer a further boost to performance for
bigger message sizes.
You can download the source and test data.
We would like to thank Bob Palowoda for his expert advice and Ezhilan Narasimhan for his work on the DoorLet, which initiated the idea for this project. We would also like to thank Rupa Nagendra for helping with the tables and formatting of this document.
|
| ||||||||||||