snippets

Cuckoo
LockFree
CornerPin
GLQuickText
MackeyGlass

Download source code

    lockfree.tar.bz2

Source code Highlights

    lifo.h
    loadTest.cpp
    producerConsumer.cpp

What is it ?

    The useful code is in lifo.h, and is a fully inlined, assembly
    implementation of a lock-free LIFO (last in, first out) stack for
    x86 and x86-64 platforms. The other files are just unit tests and
    examples.

    Lock-free algorithms allows multiple threads to safely store and
    retrieve data from a shared data structure without relying on
    traditional synchronization primitives (lock/mutexes, critical
    sections, semaphore, etc ...)

    This lock-free LIFO stack can typically be used to implement
    producer-consumer type thread message passing data structure
    that are:

        A) Very fast
        B) Not subject to the priority inversion problem.

    This implementation has been kept to a bare minimum (<100 lines
    for a given platform) and is a bit different from other lock-free
    LIFO implementations you'll find on the internet in that it uses
    a DCAS (double compare and swap) primitive for _both_ push() and
    pop(). See further down for an explanation.

    See here for more on the topic of lock-free data structures:

        http://en.wikipedia.org/wiki/Lock_free

How do I use it ?

    Below is a barebone example (see producerConsumer.cpp
    above for a complete working consumer producer example).

        #include <lifo.h>

        static Lifo recycled;   // Unused messages
        static Lifo incoming;   // Incoming messages

        // Message data structure, inherits Lifo::Node
        struct Msg:public Lifo::Node
        {
            int id;
        };

        // Consumer thread
        static void consumer()
        {
            while(1)
            {
                Msg *msg = (Msg*)incoming.pop(); // Try to get a message
                if(msg)                          // If we got a message
                {
                    doSomethingWithMsg(msg);     // Process message
                    recycled.push(msg);          // Recycle message
                }
            }
        }

        // Producer thread
        static void producer()
        {
            Msg *msg = (Msg*)recycled.pop();     // Get an unused message
            if(msg==0) msg = new Msg;            // If none make a new one
            setMessageContent(msg);              // Set message content
            incoming.push(msg);                  // Mail it to consumer
        }
    
Why should I use this ?

    Short answer: speed.

    Longer answer: if you have a C++ multi-threaded application and have
    issues with the speed of message passing between the threads or a
    lock contention problem, this code may help.

    Traditional message passing data structures tyically use locks (mutexes)
    as follows:
        - acquire lock
        - edit shared data structure
        - release lock

    In that situation, if many threads try to access the shared data
    structure at very high frequency, a lot of contention occurs, and
    the threads end up spending most of their time trying to acquire
    the lock instead of doing useful work.

    With a lock-free data structure, contention is kept to a very bare
    minimum: it is reduced to one compare-and-swap operation which may
    fail and need to be retried.

How does it work ?

    This a relatively standard implementation of the lock-free LIFO
    algorithm that relies on the CPU supporting a hardware atomic
    DCAS (double-word compare and swap) instruction.

    What makes this code a bit different from other implementations
    you will find on the net is that _both_ push() and pop() use a
    DCAS primitive: this allows you to safely repurpose the "next"
    field of the node after it's been popped from the LIFO.

    For example, like in the code snippet above, a node can safely
    be popped from one shared lock-free LIFO and pushed into another.
    
    Why does it need DCAS you ask ? (i.e. why does it need cmpxchg8b
    instead of a simple cpmxchg). The answer is the classic A-B-A
    problem. See here for a good explanation (and this implementation
    uses the 'tagging' workaround mentioned in the wikipedia page):

        http://en.wikipedia.org/wiki/ABA_problem

    Try replaying the ABA scenario in your head while brushing your
    teeth in the morning: your brain will grow a few new neurons.

Ugh, the code in lifo.h looks really gnarly ...

    The code in lifo.h is indeed a bit pre-processor heavy, because
    it tries to be fully inlinable and as efficient as possible for
    many different cases (x86, x86-64, PIC, non-PIC, etc ...)

    It does, however, compile to rather minimalist assembly code.

    For example, on x86-64,

        This code:

            Lifo        *lifo,
                Lifo::Node  *node
            )
            {   
                lifo->push(node);
            }

        compiles to the following assembly:

            __Z4pushP4LifoPNS_4NodeE:
                pushq   %rbx            # save rbx
                movq    %rsi,    %rbx   # rbx = node
                movq   (%rdi),   %rax   # rax = lifo->head
                movq  8(%rdi),   %rdx   # rdx = lifo->sync
                0:           
                movq   %rax,    (%rbx)  # node->next = lifo->head 
                movq   %rdx,     %rcx   # rcx = lifo->sync  
                inc    %rcx             # rcx = lifo->sync+1
                lock cmpxchg16b (%rdi)  # lifo->content = newContent iff lifo->content unchanged
                jnz 0b                  # try again if missed
                popq             %rbx   # restore rbx
                ret


        And this code:

            Lifo::Node *pop(
                Lifo *lifo
            )
            {
                return lifo->pop();
            }

        compiles to:

            __Z3popP4Lifo:
                pushq            %rbx     # save rbx
                0:
                movq  (%rdi),    %rax     # rax = lifo->head
                movq 8(%rdi),    %rdx     # rdx = lifo->sync
                testq  %rax ,    %rax     # is lifo->head 0 ?
                jz 1f                     # yes -> bail
                movq  (%rax),    %rbx     # rbx = lifo->head->next
                leaq 1(%rdx),    %rcx     # rcx = lifo->sync + 1
                lock cmpxchg16b (%rdi)    # lifo->head = next iff lifo->content is unchanged
                jnz 0b                    # try again if missed
                1:
                popq            %rbx      # restore rbx
                ret

How can I convince myself that it works ?

    If you want to be a 100% certain, you will need something that
    does formal proofs, and ... that's kind of hard. If you want to
    walk that path, I would start here:

        http://en.wikipedia.org/wiki/Linearizability

    However, if all you require is a certain degree of confidence
    that this code does indeed work, you can rely on heavy sampling
    (i.e.  stress testing the lock-free code and verify that it does
    perform as advertized).

    This is what the code in loadTest.cpp does: it initially pushes
    10K nodes in a lock-free LIFO, then forks 16 threads, each of
    which randomly pushes and pops nodes to/from the LIFO a large
    number (10M each) of times.

    Once all the threads are done, the code checks that:

        1. The structure of the LIFO is clean.
        2. All nodes that were in there initially are accounted for.
        3. All nodes are present exactly once.

    To do this, just download the tarball above, and compile/run
    as follows:

        tar jxpvf lockfree.tar.bz2
        cd lockfree
        make

        ./loadTest
        ./producerConsumer

    This technique my seem kind of naive but has proven surprisingly efficient
    at weeding out bugs in my code as well as various buggy lock-free LIFO
    implementations I've found floating on the net. Don't ever trust lock-free
    code unless it comes with really paranoid unit tests.

Caveats

    1. Lock-free data structures are _very_ tricky, and making sure lock-free
       code is working correctly is really, really, really hard (really).

       As such, if you plan to modify the code, make sure you know
       what you're doing, and make sure your modified code passes
       the load test.

    2. The code is only works x86 CPUs. For 32bit mode (i386), the CPU must
       support the "cmpxchg8b" instruction. For 64bit mode (x86-64), the CPU
       must support the "cmpxchg16b" instruction. Most modern (i.e. post 586)
       x86 chips do support these instructions.

    3. The LIFO code itself is CPU specific, but OS-independent. However you
       will need to compile it with GCC because it uses GCC's inline assembly
       syntax (for now).

       The unit test code needs to be able to fork threads, which unfortunately
       makes it less portable, but I have successfully tested it on Linux and Mingw.

    4. This code, like _any_ code that relies on ABA tags, is subject to the
       highly improbable - yet possible - problem of the ABA tag wrap-around.
       This can happen with a probability of probabilityOfCollision * 2^-32 on
       x86 and probabilityOfCollision * 2^-64 on x86-64.

       That's really, really small, but it is not zero ...













white
Cuckoo
LockFree
CornerPin
GLQuickText
MackeyGlass