Download source code
lockfree.tar.bz2
Source code Highlights
lifo.h
loadTest.cpp
producerConsumer.cpp
What is it ?
The useful code is in lifo.h, and is a fully inlined, assembly
implementation of a lock-free LIFO (last in, first out) stack for
x86 and x86-64 platforms. The other files are just unit tests and
examples.
Lock-free algorithms allows multiple threads to safely store and
retrieve data from a shared data structure without relying on
traditional synchronization primitives (lock/mutexes, critical
sections, semaphore, etc ...)
This lock-free LIFO stack can typically be used to implement
producer-consumer type thread message passing data structure
that are:
A) Very fast
B) Not subject to the priority inversion problem.
This implementation has been kept to a bare minimum (<100 lines
for a given platform) and is a bit different from other lock-free
LIFO implementations you'll find on the internet in that it uses
a DCAS (double compare and swap) primitive for _both_ push() and
pop(). See further down for an explanation.
See here for more on the topic of lock-free data structures:
http://en.wikipedia.org/wiki/Lock_free
How do I use it ?
Below is a barebone example (see producerConsumer.cpp
above for a complete working consumer producer example).
#include <lifo.h>
static Lifo recycled; // Unused messages
static Lifo incoming; // Incoming messages
// Message data structure, inherits Lifo::Node
struct Msg:public Lifo::Node
{
int id;
};
// Consumer thread
static void consumer()
{
while(1)
{
Msg *msg = (Msg*)incoming.pop(); // Try to get a message
if(msg) // If we got a message
{
doSomethingWithMsg(msg); // Process message
recycled.push(msg); // Recycle message
}
}
}
// Producer thread
static void producer()
{
Msg *msg = (Msg*)recycled.pop(); // Get an unused message
if(msg==0) msg = new Msg; // If none make a new one
setMessageContent(msg); // Set message content
incoming.push(msg); // Mail it to consumer
}
Why should I use this ?
Short answer: speed.
Longer answer: if you have a C++ multi-threaded application and have
issues with the speed of message passing between the threads or a
lock contention problem, this code may help.
Traditional message passing data structures tyically use locks (mutexes)
as follows:
- acquire lock
- edit shared data structure
- release lock
In that situation, if many threads try to access the shared data
structure at very high frequency, a lot of contention occurs, and
the threads end up spending most of their time trying to acquire
the lock instead of doing useful work.
With a lock-free data structure, contention is kept to a very bare
minimum: it is reduced to one compare-and-swap operation which may
fail and need to be retried.
How does it work ?
This a relatively standard implementation of the lock-free LIFO
algorithm that relies on the CPU supporting a hardware atomic
DCAS (double-word compare and swap) instruction.
What makes this code a bit different from other implementations
you will find on the net is that _both_ push() and pop() use a
DCAS primitive: this allows you to safely repurpose the "next"
field of the node after it's been popped from the LIFO.
For example, like in the code snippet above, a node can safely
be popped from one shared lock-free LIFO and pushed into another.
Why does it need DCAS you ask ? (i.e. why does it need cmpxchg8b
instead of a simple cpmxchg). The answer is the classic A-B-A
problem. See here for a good explanation (and this implementation
uses the 'tagging' workaround mentioned in the wikipedia page):
http://en.wikipedia.org/wiki/ABA_problem
Try replaying the ABA scenario in your head while brushing your
teeth in the morning: your brain will grow a few new neurons.
Ugh, the code in lifo.h looks really gnarly ...
The code in lifo.h is indeed a bit pre-processor heavy, because
it tries to be fully inlinable and as efficient as possible for
many different cases (x86, x86-64, PIC, non-PIC, etc ...)
It does, however, compile to rather minimalist assembly code.
For example, on x86-64,
This code:
Lifo *lifo,
Lifo::Node *node
)
{
lifo->push(node);
}
compiles to the following assembly:
__Z4pushP4LifoPNS_4NodeE:
pushq %rbx # save rbx
movq %rsi, %rbx # rbx = node
movq (%rdi), %rax # rax = lifo->head
movq 8(%rdi), %rdx # rdx = lifo->sync
0:
movq %rax, (%rbx) # node->next = lifo->head
movq %rdx, %rcx # rcx = lifo->sync
inc %rcx # rcx = lifo->sync+1
lock cmpxchg16b (%rdi) # lifo->content = newContent iff lifo->content unchanged
jnz 0b # try again if missed
popq %rbx # restore rbx
ret
And this code:
Lifo::Node *pop(
Lifo *lifo
)
{
return lifo->pop();
}
compiles to:
__Z3popP4Lifo:
pushq %rbx # save rbx
0:
movq (%rdi), %rax # rax = lifo->head
movq 8(%rdi), %rdx # rdx = lifo->sync
testq %rax , %rax # is lifo->head 0 ?
jz 1f # yes -> bail
movq (%rax), %rbx # rbx = lifo->head->next
leaq 1(%rdx), %rcx # rcx = lifo->sync + 1
lock cmpxchg16b (%rdi) # lifo->head = next iff lifo->content is unchanged
jnz 0b # try again if missed
1:
popq %rbx # restore rbx
ret
How can I convince myself that it works ?
If you want to be a 100% certain, you will need something that
does formal proofs, and ... that's kind of hard. If you want to
walk that path, I would start here:
http://en.wikipedia.org/wiki/Linearizability
However, if all you require is a certain degree of confidence
that this code does indeed work, you can rely on heavy sampling
(i.e. stress testing the lock-free code and verify that it does
perform as advertized).
This is what the code in loadTest.cpp does: it initially pushes
10K nodes in a lock-free LIFO, then forks 16 threads, each of
which randomly pushes and pops nodes to/from the LIFO a large
number (10M each) of times.
Once all the threads are done, the code checks that:
1. The structure of the LIFO is clean.
2. All nodes that were in there initially are accounted for.
3. All nodes are present exactly once.
To do this, just download the tarball above, and compile/run
as follows:
tar jxpvf lockfree.tar.bz2
cd lockfree
make
./loadTest
./producerConsumer
This technique my seem kind of naive but has proven surprisingly efficient
at weeding out bugs in my code as well as various buggy lock-free LIFO
implementations I've found floating on the net. Don't ever trust lock-free
code unless it comes with really paranoid unit tests.
Caveats
1. Lock-free data structures are _very_ tricky, and making sure lock-free
code is working correctly is really, really, really hard (really).
As such, if you plan to modify the code, make sure you know
what you're doing, and make sure your modified code passes
the load test.
2. The code is only works x86 CPUs. For 32bit mode (i386), the CPU must
support the "cmpxchg8b" instruction. For 64bit mode (x86-64), the CPU
must support the "cmpxchg16b" instruction. Most modern (i.e. post 586)
x86 chips do support these instructions.
3. The LIFO code itself is CPU specific, but OS-independent. However you
will need to compile it with GCC because it uses GCC's inline assembly
syntax (for now).
The unit test code needs to be able to fork threads, which unfortunately
makes it less portable, but I have successfully tested it on Linux and Mingw.
4. This code, like _any_ code that relies on ABA tags, is subject to the
highly improbable - yet possible - problem of the ABA tag wrap-around.
This can happen with a probability of probabilityOfCollision * 2^-32 on
x86 and probabilityOfCollision * 2^-64 on x86-64.
That's really, really small, but it is not zero ...
|