Multithreading - Traps and Pittfals

Multithreading is just one damn thing after, before, or simultaneous
with another.    
--Andrei Alexandrescu    


Beyond the errors which can occur in single-threaded programs,
multithreaded environments are subject to additional errors:

- Race conditions
- Deadlock
- Priority failures (priority inversion, starvation, etc...)

Moreover testing and debugging of multithreaded programs are harder.
Multithreaded programs are non-deterministic. Failures are often
non-repeatable. Debugged code can produce very different results then
non-debugged ones. Testing on single processor hardware may produce
different results than testing on multiprocessor hardware.

In this section we show how easy to make mistakes when we are not 
carefull enough. The text-book example is the double-checked lockig
pattern, invented by Douglas C. Schmidt in the Washington University,
St. Luis, and the idea has been appeared in a chapter of the book
Pattern Languages of Program Design 3. published by Addison-Wesley, 1997.


Singleton Pattern

Singleton is a well-known Design Pattern, which ensures that a class 
has only one instance and provides a global point of access to that
instance. For many reasons, dynamically allocated Singletons are very
popular in C++ programs. (One reason: the evaluation order of static 
data declarations is not defined over multiple compilation units, one
other: lazy instantiation: if there is no request for the Singleton
object, we can spare the initialization cost.)

The canonical implementation of the Singleton pattern is the following:

    // in singleton.h:

    class Singleton
    {
    public:
        static Singleton *instance();

        void other_method();
        // other methods ...
    private:
        static Singleton *pinstance;
    };


    // in singleton.cpp:
    
    Singleton *Singleton::pinstance = 0;

    Singleton *Singleton::instance()
    {
        if ( 0 == pinstance )
        {
            pinstance = new Singleton;  // lazy initialization
        }

        return pinstance;
    }


And the user simply call other methods on the instance() member function:

    // in client.cpp:

    Singleton::istance()-> other_method();


Thread Safety

It is clear that the canonical implementation of Singleton is not thread
safe. Let consider the following race condition: Thread A and B try to use
Singleton simultaneously on a preemptive parallel environment before its
first initialization.

First thread A evaluates the if condition and consider its value as a
null pointer. Then this thread is suspended.

Now thread B evaluates the condition, reads the null pointer and goes on
to create the instance. The Singleton has been created.

Thread A gets the controll again. It continues to create the Singleton.
Now a second instance has been created.

A critical section is a sequence of instructions that obeys the following
invariant: only one thread/process must execute at a time. The
creation/initialization of the Singleton pattern is a critical section.

The common way to implement a critical section is to add a Lock, like a
Mutex. With a static Mutex, make the Singleton pattern thread safe is easy.

    // in singleton.h:

    class Singleton
    {
    public:
        static Singleton *instance();

        void other_method();
        // other methods ...
    private:
        static Singleton *pinstance;
        static Mutex      lock_;
    };


    // in singleton.cpp:
    
    Singleton *Singleton::pinstance = 0;

    Singleton *Singleton::instance()
    {
        Guard<Mutex> guard(lock_);  // constructor acquires lock_
        
        // this is now the critical section        
    
        if ( 0 == pinstance )
        {
            pinstance = new Singleton;  // lazy initialization
        }

        return pinstance;
    }                               // destructor releases lock_


The ACE::Guard class can use different kind of lock strategies via the
template parameter. Guard uses the RAII (Resource Allocation Is
Initialization) principle: the constructor requires the lock, and the
destructor releases it.


Double-Checked Locking Pattern

However the previous solution now works correctly, it is far from ideal.
The problem is, that although the problematic race condition is connected
only to the initialization of the Singleton instance, the critical section
executed for every calls of the instance() method. All access to the instance
acquire and release the lock. Such an excessive usage of the locking mechanism
may cause serious overhead which could not be acceptable.

One obviously wrong optimalization attempt would be to move the critical
section inside the conditional check of pinstance:

    Singleton *Singleton::instance()
    {
        if ( 0 == pinstance )
        {
            Guard<Mutex> guard(lock_);  // constructor acquires lock_
        
            // this is now the critical section        
            
            pinstance = new Singleton;  // lazy initialization
        }                               // destructor releases lock_

        return pinstance;
    }


There is still a race-condition, when each threads read pinstance as null
pointer. Then one thread is able to require the lock, the other has been
blocked. The first thread initializes the instance and then releases the
lock. Now the second thread continous the execution after the condition
check and re-initialize the Singleton.

The solution proposed by Schmidt uses a second check to ensure thread safety.

    Singleton *Singleton::instance()
    {
        if ( 0 == pinstance )
        {
            Guard<Mutex> guard(lock_);  // constructor acquires lock_
        
            // this is now the critical section        

            if ( 0 == pinstance )   // re-check pinstance
            {
                pinstance = new Singleton;  // lazy initialization
            }
        }                               // destructor releases lock_

        return pinstance;
    }


Unneccessary locking is avoided by wrapping the call to new with another
conditional test. When in the previous (bad) scenario, the blocked thread
wakes up, it re-checks the state of the pointer, and finds that it is not 
null anymore. The solution minimalizes the lock attempts and still ensures
correctness.

But...


The Problem with DCLP

At the end of its article Schmidt mention a certain disadvantage of the
pattern: There is a subtle portability issue that can lead to pernicious
bugs if the pattern is used in software that have non-atomic pointer or
integral assignment semantics. ... A related problem can occur if an
overly-agressive compiler optimizes (pinstance) by caching it in some
way (e.g. storing it in a register) or by removing the second check of
0 == pinstance.

Consider this part of the DLCP implementation:

    pinstance = new Singleton;


This does three things:

(1) Allocate memory (via operator new) to hold the Singleton object.

(2) Construct the Singleton object in the memory.

(3) Assign to pinstance the address off the memory.

But they need not to be done in this order! For example, the compiler is
free to generate the following code:

    pinstance =                             // (3)
        operator new( sizeof(Singleton) );  // (1)
    new (pinstance) Singleton;              // (2)


If this code is generated, the order is 1, 3, 2 .

In the context:

    Singleton *Singleton::instance()
    {
        if ( 0 == pinstance )
        {
            Guard<Mutex> guard(lock_);
        
            // this is now the critical section        

            if ( 0 == pinstance )   // re-check pinstance
            {
                // pinstance = new Singleton;
                pinstance =                             // (3)
                    operator new( sizeof(Singleton) );  // (1)
                new (pinstance) Singleton;              // (2)
            }
        }

        return pinstance;
    }


Consider: thread A executes (1) and (3) and is suspended. Now pinstance
is not null, however the instance has not been constructed. Thread B
checks the outmost condition, sees pinstance not null, and continues to
return the pointer to the non-initialized Singleton object.

The fundamental problem: we need a way to specify instruction ordering
in C++. The sad news is that there is no way for that in C++ (or in C)!
Sequence Points and Observable Behavior

The C++ standard defines so called sequence points. There is a sequence
point at the end of each statement, i.e., "at the semicolon". Also there
is a number of other cases for sequence points, like the shortcut-behaviour
logical operators, the conditional operator, the comma operator and at the
function calls.

When a sequence point is reached, all side effects of previous evaluations
shall be complete and no side effects of subsequent evaluations shall have
taken place.

This suggest that perhaps a careful statement ordering could help to control
the order of the generated instructions. It does not. The compiler may reorder
instruction as long as they preserve the observable behavior of the C++
abstract machine.


Observable behavior consist only of

- Reads and writes of volatile data.
- Calls to library I/O functions.

    void foo()
    {
        int x = 0, y = 0;       // (1)
        x = 5;                  // (2)
        y = 10;                 // (3)
        printf( "%d,%d", x, y); // (4)
    }


In the previous example (1)-(3) may be optimized away, because they have
no observable behavior. If the are executed, (1) must preceed (2)-(4) and
(4) must follow (1)-(3). We know nothing about the relative execution order
of (2) and (3). Either may come first orthey might run simultaneously.
Sequence points offer no control over the execution of statements.


            if ( 0 == pinstance )   // re-check pinstance
            {
                // pinstance = new Singleton;
                Singleton *temp =                    
                    operator new( sizeof(Singleton) );
                new (temp) Singleton;
                pinstance = temp;
            }


Therefore the attempt above is not a solution for the problem. The compiler
will likely optimize out the temp variable. Also other "tricks" will likely
fail: declaring temp as extern , and defining in an other compilation unit.
Declaring the Singleton constructor in a different unit to avoid inlining or
assume that it may throw an exception, etc...


The compiler has a huge motivation for such optimizations:

- Execute operations in parallel where the hardware allows it.
- To avoid spilling data from register.
- To perform common subexpression elimination.
- To keep the instruction pipeline full.
- To reduce the size of the generated executable.


Volatile

Douglas Schmidt proposed volatile to prevent "agressive optimalizations"
in his article. Volatile prevents some compiler optimalizations.

The order of reads/writes of volatile data must be preserved.

Volatile memory may change outside of program control, i.e. some hardware
action, therefore "redundant" reads/writes in source code must not be
eliminated.


This is the C++ meaning of volatile. In Java and in C# volatile is different.
They do add acquire/release semantics to volatile.

Volatile would seem to make the temporary variable strategy viable:

    // in singleton.h:

    class Singleton
    {
    public:
        static Singleton *instance();

        void other_method();
        // other methods ...
    private:
        static volatile Singleton *pinstance;
        static          Mutex      lock_;

        // for example:
        int x;
        Singleton() : x(1)  {}      // test constructor
    };


    // in singleton.cpp:
    
    Singleton *Singleton::pinstance = 0;

    Singleton *Singleton::instance()
    {
        if ( 0 == pinstance )
        {
            Guard<Mutex> guard(lock_);
        
            // this is now the critical section        

            if ( 0 == pinstance )   // re-check pinstance
            {
                // these lines can't be reordered
                volatile Singleton *temp = new Singleton;   (1)
                pinstance = temp;                           (2)
            }
        }

        return pinstance;
    }


Indeed, the lines (1) and (2) must not be reordered. This is the code
generated after inlining the constructor:

    Singleton *Singleton::instance()
    {
        if ( 0 == pinstance )
        {
            Guard<Mutex> guard(lock_);
        
            // this is now the critical section        

            if ( 0 == pinstance )   // re-check pinstance
            {
                // these lines can't be reordered
                volatile Singleton *temp = static_cast<Singleton *>
                            (operator new(sizeof(Singleton)));       //  (1)
                temp -> x = 1;                                      //  (2)
                pinstance = temp;                                   //  (3)
            }
        }

        return pinstance;
    }


where (2) is the constructor inlined. Though temp is volatile, *temp is
not, so instructions can be reordered.

            if ( 0 == pinstance )   // re-check pinstance
            {
                // these lines can't be reordered ???
                volatile Singleton *temp = static_cast<Singleton *>
                            (operator new(sizeof(Singleton)));      //  (1)
                pinstance = temp;                                   //  (3)
                temp -> x = 1;                                      //  (2)
            }


And here, pinstance again points to an uninitialized memory.

The main problem is that constraints on the observable behavior apply only
to C++ abstract machine. But the abstract machine was defined as implicitelly
single threaded.


Memory Model

Modern hardware architecture with multiprocessors looks in the following:

             ______________________________________
            |                                      |
            |                                      |
            |                memory                |
            |                                      |
            |______________________________________|
                |       |       |       |       |
                |       |       |       |       |
             cache1  cache2  cache3  cache4  cache5
                |       |       |       |       |
              proc1   proc2   proc3   proc4   proc5



The value of a shared memory location may appear in more then one cache.
They might different values in the different caches. Hardware regularly
takes care of this situations. Different contents of the cache may be
flushed in a different order as they were written. Example: processor A
modifies x variable then modifies y variable. When the cache is flushed,
y is flushed before x. Therefore an other processor may see y value change
before x value change.

Out of order write visibility can cause DCLP to fail. Remember, that
pInstance = new Singleton; does the following:

(1) Allocate memory (via operator new) to hold the Singleton object.
(2) Construct the Singleton object in the memory.
(3) Assign to pinstance the address off the memory.

Now suppose, processor A does these steps in this order. Processor B sees
the result of (3) before the result of (2), therefore goes on to use the
Singleton object as it would be initialized. (It is, but the cache was not      written out yet.)

C++ offers no guaranties that how different threads see the phenomenon above.
Some libraries, like POSIX offers some guaranties.

Java memory model was revised to offer guaranties. Download the latest
documentation from http://www.cs.umd.edu/~pugh/java/memoryModel/jsr133.pdf.

The visibility problems described above can be prevented via the use of
memory barriers. Barriers are instructions that constraint the reads and
writes to allow readers and writers to develop some consistent strategy
on memory usage.

Readers use an acquire barrier to prevent subsequent source code memory
accesses moving before the barrier.

Writers use a release barrier to prevent prior source code memory access
moving after the barrier.

Then one can use a protocol of release/acquire. The release changes the
state of memory and the acquire guaranties that the reader see the new
state. Compilers may not reorder reads and writes that violates memory
barrier semantics. Runtime systems and hardware may also not violate that.

    Singleton *Singleton::instance()
    {
        Singleton *temp = pInstance;    // read pInstance
        
        Perform acquire;    
        // prevent visibility of later memory operations
        // from moving up from this point

        if ( 0 == temp )
        {
            Guard<Mutex> guard(lock_);
        
            // this is now the critical section        

            if ( 0 == pinstance )   // re-check pinstance
            {
                temp = new Singleton;
                
                Perform release;
                // prevent visibility of earlier memory operations
                // from moving down from this point

                pinstance = temp;   // write pInstance
            }
        }

        return pinstance;
    }


Problem: memory barriers are not implemented on every hardware so
the code above is not portable.


Rule for Avoiding Memory Visibility Problems

Shared data should be accessed only if one of the following is true:
- The access is inside a critical section.

For communication via message passing, these protocols are followed:

(1) Read ready flag; perform acquire; read message.
(2) Write message; perform release; write ready flag.

DCLP effectively uses message passing:

pInstance is the ready flag: if it is not null, then "ready".
pInstance itself is the message.


    Singleton *Singleton::instance()
    {
        Singleton *temp = pInstance;    // read flag
        
        Perform acquire;                // acquire

        if ( 0 == temp )
        {
            Guard<Mutex> guard(lock_);
        
            // this is now the critical section        

            if ( 0 == pinstance )
            {
                temp = new Singleton;   // write message
                
                Perform release;        // release

                pinstance = temp;       // write flag
            }
        }

        return pinstance;
    }


This is the correct messaging protocol.
Locks as Barriers

It is critical, that guard is initialized before the second if statement.
Why the following is never happened:

        if ( 0 == temp )
        {
            if ( 0 == pinstance )
            {
                // move lock inside the second conditional statement
                Guard<Mutex> guard(lock_);

                temp = new Singleton;   // write message



Threading libraries ensure that lock acquisition includes the moral
equivalent of an acquire, and lock release includes the equivalent
of a release.

    Singleton *Singleton::instance()
    {
        Singleton *temp = pInstance;    // read flag
        Perform acquire;                // acquire

        if ( 0 == temp )
        {
            Guard<Mutex> guard(lock_);
            Perform acquire;            // constructor of lock
        
            // this is now the critical section        

            if ( 0 == pinstance )
            {
                temp = new Singleton;   // write message
                
                Perform release;        // release
                pinstance = temp;       // write flag
            }
            Perform release;            // destructor of lock
        }

        return pinstance;
    }


Thus locks provide us a portable way to insert memory barriers.

    Singleton *Singleton::instance()
    {
        Singleton *temp = pInstance;    // read flag
        Guard<Mutex> guard1(lock_);     // acquire

        if ( 0 == temp )
        {
            Guard<Mutex> guard(lock_);  // acquire

            if ( 0 == pinstance )
            {
                {
                    Guard<Mutex> guard(lock_);  // acquire
                    temp = new Singleton;       // write message
                }                               // release
                pinstance = temp;       // write flag
            }                           // release    
        }
        return pinstance;
    }                                   // release


Which is the original, bad efficiency code :( .

Solution? There is no portable way to implement DCLP in C++. If the original
version (the instance() function is the critical section) is a bottleneck,
then users are encouraged to cache the pointer to the instance.