A (Bare-Bones) User-Level Thread Library

2016-08-11 tech

This post describes the basic idea behind preemptive scheduling and threads in general. Then, we implement a user-level thread library for Unix-like systems.

For something as essential in computer programming as threads, the idea behind them is actually quite simple. The resources of the computer are not consumed by one thread of execution alone. Instead they are shared between many threads doing different kinds of work. At least that is what it will look like when using high-level thread libraries. What is actually happening is quite different. In reality, a traditional CPU can only run one thread of execution at a time. Now in recent years we saw the arrival of multi-core systems and hyperthreading. These technologies allow us to run more than one thread at the very same time. It is as if there were two computers with two CPUs doing computation on their own. But what if you want to write a program that uses 100 threads? That is way more than typical hardware of today supports.

All threads are managed by a program called scheduler. This scheduler keeps track of all threads and then assigns each of them a time slice. This time slice is an amount of assigned time the thread can use for execution. After the time slice has expired, the scheduler kicks in again. It stops the running thread from executing and assigns a new time slice to new thread. As these time slices are quite short, it appears as if the threads were running in parallel. What is actually happening is different. The running threads are constantly switched out for other ones. This approach is refereed to preemptive threading. This is just like how digital video is really just one static images after another. Our computers don’t run every single program at the same time. It just appears that way.

A similar approach is cooperative threading. Here, the threads have to call a function on their own that yields the current thread for another one. There is no timer that kicks in in regular intervals. While this is easier to implement it has two big disadvantages. First of all it requires a lot of extra work for the programmer to write a program utilizing cooperative threading. You always have to think about when to yield. If you yield too often, your program will perform poorly. If you yield too rarely, other threads get to do no work. Secondly, if a thread enters an infinite loop without ever yielding no other thread will ever run again.

Kernel Level vs. User Level

When you call pthread_create(3) in yor C program what you get is most likely a kernel thread. A kernel thread is a thread managed by the operating system kernel. The scheduler lives in the kernel. Things are different with user threads. There, the scheduler lives in user space. Now this has some interesting implications. Usually programs running in user mode have less privileges than those running in kernel mode. Because of this, user level threads probably aren’t able to fully use the systems hardware.

What Makes a Thread

When a thread is running, it is working off a series of machine instructions. To remember which instruction is currently being executed, there exists the instruction pointer. The instruction pointer points at an address in memory containing the current instruction. This pointer is stored in a special register on the CPU die. There are other registers, too. They store information about the program stack or are just used for computation. When implementing user level threads, we will need to back up all this information. We need to save it when a thread yields execution and later restore when it resumes execution.

Lucky for us, Unix-like systems offer four functions related to the ucontext_t type. Such a ucontext_t structure contains the context of a thread, just what we need to restore it later. The four functions are:

getcontext(3): Backs up the currently running context
setcontext(3): Changes the currently executing context to a backed up one.
makecontext(3): Creates a new context. It modifies an old context so it starts executing at a given function.
swapcontext(3): Swaps out an old context for a new one. At the same time, it also backs up the old one.

To get a good understanding of these functions it is a good idea to read their manual pages. There you should also learn that using them is actually discouraged. You may even run into undefined behavior when using them. I find them to be useful for learning. But when it comes to writing real programs, please do use POSIX threads.

Implementation

Now that all of this is said, let’s get going. We will write a user-level thread library with an interface consisting of just three functions: threads_create creates a new thread, threads_exit exits execution of the current thread and threads_join joins two threads, that is the calling thread waits for another thread to call threads_exit. The implementation is split in three C modules.

tcb Module

To keep track of all the threads we will create a thread control block for each of them. A thread control block contains everything needed to restore the thread to execution (the context). It also stores some other tidbits for bookkeeping.

typedef struct {
    int id;
    ucontext_t context;
    bool has_dynamic_stack;
    void *(*start_routine) (void *);
    void *argument;
    void *return_value;
} TCB;

id is the identifier of the thread later returned by threads_create. context and has_dynamic_stack tells us something about the context of the thread explained later. start_routine is a pointer to a function with one argument of type void * and same return type, argument is the argument for that start_routine. return_value will be used to store the result from start_routine. We need to store it so it can later be obtained through threads_join.

We must ensure that id remains unique. So for creating these structures, will use a function tcb_new instead of just plain allocating memory.

TCB *tcb_new(void)
{
    static int next_id;

    TCB *new;

    if ((new = calloc(1, sizeof(TCB))) == NULL) {
        return NULL;
    }

    new->id = next_id++;
    return new;
}

If there is a function for creating it, we also need one for destroying it. tcb_destroy serves that purpose.

void tcb_destroy(TCB *block)
{
    if (block->has_dynamic_stack) {
        free(block->context.uc_stack.ss_sp);
    }

    free(block);
}

As we will see later, when creating a new thread it needs a stack of its own. It is allocated dynamically using malloc(3). That too will need to be freed when destroying the thread control block (and thus the thread). Hence the line free(block->context.uc_stack.ss_sp).

queue Module

To keep track of these TCB structures, we need some kind of data structure. As this is just a bare-bones implementation, a simple queue is enough. The interface is defined in queue.h, the implementation not very interesting.

typedef struct queue QUEUE;


/* Create a new initialized QUEUE on the heap. Returns a pointer
   to the new block or NULL on error. */
QUEUE *queue_new(void);


/* Destroy queue, freeing all associated memory with it. It also
   frees all memory of the elements inside the queue. */
void queue_destroy(QUEUE *queue);


/* Return the number of items in queue. */
size_t queue_size(const QUEUE *queue);


/* Add elem to the end of queue. Returns 0 on succes and non-zero
   on failure.*/
int queue_enqueue(QUEUE *queue, TCB *elem);


/* Remove the first item from the queue and return it. The caller
   will have to free the reuturned element. Returns NULL if the
   queue is empty. */
TCB *queue_dequeue(QUEUE *queue);


/* Remove element with matching id from the queue. Returns the
   removed element or NULL if no such element exists. */
TCB *queue_remove_id(QUEUE *queue, int id);

threads module

In our threads.c module we create two global queues:

static QUEUE *ready, *completed;

ready contains all threads that are ready for execution. completed contains the threads that have called threads_exit but were not yet joined on.

To keep track of the currently running thread, we also create a global reference to a thread control block.

static TCB *running;

To access the currently running thread, we take a look at running. The threads currently waiting to get a slice of the CPU pie are waiting in ready. As they are these three globals are uninitialized. We will change that later in threads_create.

Setting an Alarm

We will implement preemptive threading using Unix signals. If you have played around with Unix signals you may have used alert(2). After calling it, an alarm is set and then after n seconds the SIGALRM signal is raised. This is almost what we want, but not quite. alert(2) uses real time, it works like an actual alarm clock. But remember that there are many other processes running on your computer managed the by the kernel and its scheduler. It makes no sense to wait n seconds and then switch out the currently running user thread when that user thread hasn’t done any work at all yet. Rather we will use a profiling timer with the help of getitimer(2). It only raises after a certain time has been consumed by the thread.

static bool init_profiling_timer(void)
{
    // Install signal handler

    sigset_t all;
    sigfillset(&all);

    const struct sigaction alarm = {
        .sa_sigaction = handle_sigprof,
        .sa_mask = all,
        .sa_flags = SA_SIGINFO | SA_RESTART
    };

    struct sigaction old;

    if (sigaction(SIGPROF, &alarm, &old) == -1) {
        perror("sigaction");
        abort();
    }

    const struct itimerval timer = {
        { 0, 10000 },
        { 0, 1 }  // arms the timer as soon as possible
    };

    // Enable timer

    if (setitimer(ITIMER_PROF, &timer, NULL) == - 1) {
        if (sigaction(SIGPROF, &old, NULL) == -1) {
            perror("sigaction");
            abort();
        }

        return false;
    }

    return true;
}

After calling this function, every 10000 µs of execution a SIGPROF signal is emitted. The signal is then handled in the handle_sigprof signal handler.

static void handle_sigprof(int signum, siginfo_t *nfo, void *context)
{
    int old_errno = errno;

    // Backup the current context
    
    ucontext_t *stored = &running->context;
    ucontext_t *updated = (ucontext_t *) context;

    stored->uc_flags = updated->uc_flags;
    stored->uc_link = updated->uc_link;
    stored->uc_mcontext = updated->uc_mcontext;
    stored->uc_sigmask = updated->uc_sigmask;

    // Round robin

    if (queue_enqueue(ready, running) != 0) {
        abort();
    }

    if ((running = queue_dequeue(ready)) == NULL) {
        abort();
    }

    // Manually leave the signal handler

    errno = old_errno;
    if (setcontext(&running->context) == -1) {
        abort();
    }
}

What is important to understand here is that signal handlers usually run in a context of their own. It is different from the context of the thread that was running before the program got the signal. If we were to call getcontext(3) we could get the wrong context. Instead we use the last parameter of the signal handler, context. It is a pointer to an ucontext_t structure representing the context before the signal handler was run.

At least that is how it used to be, to quote the man page of getcontext(3) on my Debian machine:

If the context was obtained by a call to a signal handler, then old standard text says that “program execution continues with the program instruction following the instruction interrupted by the signal”. However, this sentence was removed in SUSv2, and the present verdict is “the result is unspecified”.

We can still use this argument in our quest to understand the idea behind threading. Just don’t use it in any production code!

At first the old context is backed up by updating the ucontext_t structure of the running thread. The running thread control block is then put at the end of the running queue. To get the next thread in line, we dequeue a thread control block from the ready queue. This dequeued thread becomes the new running thread. We update running and then call setcontext on the address of running->context. This is all our scheduling logic amounts to. Every thread gets an equal amount of time (they are all of equal priority) and they all work in order. This kind of bare-bones scheduling is usually called round robin scheduling.

Synchronization

After calling init_profiling_timer, every 10000 µs of execution the currently running thread gets interrupted. This is what we want, but if at the same time we modify any of the queues, the queue could get corrupted. Because of this we will block SIGPROF in some functions. Unix offers the sigprocmask function to do this. Using it is a bit bothersome, so let’s write two helper functions.

static void block_sigprof(void)
{
    sigset_t sigprof;
    sigemptyset(&sigprof);
    sigaddset(&sigprof, SIGPROF);

    if (sigprocmask(SIG_BLOCK, &sigprof, NULL) == -1) {
        perror("sigprocmask");
        abort();
    }
}


static void unblock_sigprof(void)
{
    sigset_t sigprof;
    sigemptyset(&sigprof);
    sigaddset(&sigprof, SIGPROF);

    if (sigprocmask(SIG_UNBLOCK, &sigprof, NULL) == -1) {
        perror("sigprocmask");
        abort();
    }
}

Creating a Thread

When threads_create gets called for the first time, the threads module is still uninitialized. There are three things that need to be initialized. First there are the queues ready and completed. A helper function will take care of that.

static bool init_queues(void)
{
    if ((ready = queue_new()) == NULL) {
        return false;
    }

    if ((completed = queue_new()) == NULL) {
        queue_destroy(ready);
        return false;
    }

    return true;
}

Secondly, we need to threadify, for lack of a better word, the original context. When the program first starts execution we can think of it as running in a single thread. But our libray knows nothing about this yet. We write a function init_first_context that changes that. In it, the currently running context is backed up in a thread control block. As indeed this first thread is the thread currently running, the running global is set to it.

static bool init_first_context(void)
{
    TCB *block;

    if ((block = tcb_new()) == NULL) {
        return false;
    }

    if (getcontext(&block->context) == -1) {
        tcb_destroy(block);
        return false;
    }

    running = block;
    return true;
}

The last thing we will initialize is the timer. For this we will use the already discussed function init_profiling_timer.

Now we can implement threads_create. The first third of the function takes care of initialization using the fucntions from earlier.

int threads_create(void *(*start_routine) (void *), void *arg)
{
    block_sigprof();

    // Init if necessary

    static bool initialized;

    if (! initialized) {
        if (! init_queues()) {
            abort();
        }

        if (! init_first_context()) {
            abort();
        }

        if (! init_profiling_timer()) {
            abort();
        }
        
        initialized = true;
    }

Now we need to actually create the new thread, that is the thread control block. Once created, we use getcontext(3) to capture the current context, allocate space used as that threads stack and then use makecontext(3) to change the captured context to fit with the parameters of the new thread.

    // Create a thread control block for the newly created thread.

    TCB *new;

    if ((new = tcb_new()) == NULL) {
        return -1;
    }

    if (getcontext(&new->context) == -1) {
        tcb_destroy(new);
        return -1;
    }

    if (! malloc_stack(new)) {
        tcb_destroy(new);
        return -1;
    }

    makecontext(&new->context, handle_thread_start, 1, new->id);

    new->start_routine = start_routine;
    new->argument = arg;

    // Enqueue the newly created stack

    if (queue_enqueue(ready, new) != 0) {
        tcb_destroy(new);
        return -1;
    }

    unblock_sigprof();
    return new->id;
}

We used two so far not mentioned functions here: malloc_stack and handle_thread_start as a paramter of the makecontext(3) call.

The first function, malloc_stack just allocates memory on the heap later used as a stack and then modifies the context stored in the thread control block to include that stack.

static bool malloc_stack(TCB *thread)
{
    // Get the stack size

    struct rlimit limit;

    if (getrlimit(RLIMIT_STACK, &limit) == -1) {
        return false;
    }

    // Allocate memory

    void *stack;

    if ((stack = malloc(limit.rlim_cur)) == NULL) {
        return false;
    }

    // Update the thread control bock

    thread->context.uc_stack.ss_flags = 0;
    thread->context.uc_stack.ss_size = limit.rlim_cur;
    thread->context.uc_stack.ss_sp = stack;
    thread->has_dynamic_stack = true;

    return true;
}

ss_size is the size of the stack, ss_sp the beginning of the stack and ss_flags set to 0 indicates a new stack. See the manual page for getcontext(3) for details.

The second and more interesting function is handle_thread_start used with makecontext(3). makecontext(3) takes a ucontext_t structure usually obtained through getcontext(3) and modifies it. Once the modified context is activated, execution starts at function func with variable number of int arguments. These arguments were passed to makecontext(3). Again, details are described in the manual page.

You may have noticed that the interface of makecontext(3) doesn’t match our POSIX-inspired interface. makecontext(3) starts in a function with int arguments, but we want void* . For this reason, we need a little wrapper, handle_thread_start.

static void handle_thread_start(void)
{
    block_sigprof();
    TCB *this = running;
    unblock_sigprof();

    void *result = this->start_routine(this->argument);
    threads_exit(result);
}

This wrapper actually does a bit more than that, it also calls threads_exit. This way, every thread created with our library is guaranteed to call threads_exit. Even if the thread only returns.

The only exception here is the first thread which is not run through this wrapper. Remember it was created with init_first_context. In our implementation the first thread is a bit of a special snowflake, when it exits the whole program exits. There are ways around this, but we won’t do that here.

threads_exit is fairly simple. First, it enqueues the currently running thread control block in the completed queue. It then yields the processor to the next thread in line.

void threads_exit(void *result)
{
    if (running == NULL) {
        exit(EXIT_SUCCESS);
    }

    block_sigprof();

    running->return_value = result;

    if (queue_enqueue(completed, running) != 0) {
        abort();
    }

    if ((running = queue_dequeue(ready)) == NULL) {
        exit(EXIT_SUCCESS);
    }

    setcontext(&running->context);  // also unblocks SIGPROF
}

Last thing we need to implement is threads_join.

int threads_join(int id, void **result)
{
    if (id < 0) {
        errno = EINVAL;
        return -1;
    }

    block_sigprof();
    TCB *block = queue_remove_id(completed, id);
    unblock_sigprof();

    if (block == NULL) {
        return 0;
    } else {
        *result = block->return_value;
        tcb_destroy(block);
        return id;
    }
}

All threads_join does is look through the queue completed. If an element with matching id is found, that thread control block is destroyed for good. The return value is written into result.

Full Code

The source code is now available on Github at github.com/kissen/threads.