Awasu » Extending Python: Managing the GIL
Monday 4th November 2019 4:30 PM []

Another common reason for wanting to write some of your code in C/C++ is for better multi-threaded performance, since Python is known to not handle this kind of thing well.

Let's simulate some lengthy processing by adding a delay to our C++ code:

PyObject*
py_add( PyObject* pSelf, PyObject* pArgs )
{
    // extract the arguments
    int val1, val2 ;
    if ( ! PyArg_ParseTuple( pArgs, "ii", &val1, &val2 ) )
        return NULL ;

    // add the numbers
    sleep( 1 ) ; // nb: simulate slow processing
    int result = val1 + val2 ;

    return PyLong_FromLong( result ) ;
}

Here's a test script that kicks off a bunch of worker threads, each one calling our C++ add() function:

import threading

import demo

# ---------------------------------------------------------------------

def worker( val1, val2 ):
    # call the C++ function
    result = demo.add( val1, val2 )
    print( "{} + {} = {}".format( val1, val2, result ) )

# ---------------------------------------------------------------------

# start the worker threads
threads = [
    threading.Thread( target=worker, args=(10,i) )
    for i in range( 1, 5+1 )
]
for t in threads:
    t.start()

# wait for all the threads to finish
for t in threads:
    t.join()

This outputs the correct results, but if we check how long it takes to run, it's apparent that something is wrong:

  

We would expect the script to take about a second to run, because it kicks off 5 worker threads, each one calling add(), which takes about a second to run, but all this should be happening in parallel i.e. the script should take about a second to run.

However, if you watch the script running, you can see the results being output one at a time, about a second apart, for a total of about 5 seconds, which suggests that the worker threads are being run serially, with one finishing before the next one starts[1]An experienced developer will also notice that the worker threads are running in the same order that they were created in, which is dead giveaway.. Anyone who knows Python well enough to be reading this article will instantly suspect the GIL as the cause of this problem, and they'd be right :|

Managing the GIL

The problem is that when our C++ function is entered, the GIL has been locked, and remains locked for the duration of our "lengthy" processing, which will prevent Python code in other threads from running.

The solution is to unlock the GIL, but we need to be careful when we do this, since the purpose of the GIL is to prevent access to Python objects from multiple threads, so we need to be sure that we don't touch anything belonging to Python while the GIL is unlocked:

PyObject*
py_add( PyObject* pSelf, PyObject* pArgs )
{
    // extract the arguments
    int val1, val2 ;
    if ( ! PyArg_ParseTuple( pArgs, "ii", &val1, &val2 ) )
        return NULL ;

    // add the numbers
    // NOTE: We no longer need to access any Python objects, so we can release the GIL.
    int result ;
    Py_BEGIN_ALLOW_THREADS // this releases the GIL
    sleep( 1 ) ;
    result = val1 + val2 ;
    Py_END_ALLOW_THREADS // this locks the GIL

    return PyLong_FromLong( result ) ;
}

Running the test script against this new version gives the expected result, with it taking about a second to run:

  
Also note that the output is "out-of-order", which is to be expected since the order in which threads run, and therefore finish, is unpredictable.

RAII'ing the GIL

Any time you have something that needs to be initialized, and then cleaned up afterwards, it's a candidate for an RAII helper class. While it's not strictly necessary, it would be nice to be able to write something like this:

class ReleaseGIL // IMPORTANT: This code won't work!
{
public:
    ReleaseGIL() { Py_BEGIN_ALLOW_THREADS }
    ~ReleaseGIL() { Py_END_ALLOW_THREADS }
} ;

which would let us write our GIL-management code like this:

{
    ReleaseGIL releaseGIL ;
    sleep( 1 ) ; // nb: simulate slow processing
    result = val1 + val2 ;
}

When the code block is entered, a ReleaseGIL object is created, which calls Py_BEGIN_ALLOW_THREADS, and when the code block is exited, the ReleaseGIL object is destroyed, thus calling Py_END_ALLOW_THREADS. Importantly, this cleanup process will always happen, regardless of how the code block was exited e.g. by executing off the end of it, an early return, an exception being thrown.

Unfortunately, the way the Py_BEGIN/END_ALLOW_THREADS macros have been written prevents us from doing this, so we have to "expand" these macros ourself, and implement the helper class like this:

class ReleaseGIL
{
public:
    ReleaseGIL() { mpThreadState = PyEval_SaveThread() ; }
    ~ReleaseGIL() { PyEval_RestoreThread( mpThreadState ) ; }
private:
    PyThreadState* mpThreadState ;
} ;

I also like to define some helper macros, to make it clearer in the code what's happening:

#define BEGIN_RELEASE_GIL() { ReleaseGIL _releaseGil ;
#define END_RELEASE_GIL() }

This lets me write the GIL-management code like this:

    BEGIN_RELEASE_GIL() {
        sleep( 1 ) ; // nb: simulate slow processing
        result = val1 + val2 ;
    } END_RELEASE_GIL()
Download the source code here.


« Callbacks

Tutorial index

   [ + ]

1. An experienced developer will also notice that the worker threads are running in the same order that they were created in, which is dead giveaway.
Have your say