A (Bare-Bones) User-Level Thread Library
techThis post describes the basic idea behind preemptive scheduling and threads in general. Then, we implement a user-level thread library for Unix-like systems.
For something as essential in computer programming as threads, the idea behind them is actually quite simple. The resources of the computer are not consumed by one thread of execution alone. Instead they are shared between many threads doing different kinds of work. At least that is what it will look like when using high-level thread libraries. What is actually happening is quite different. In reality, a traditional CPU can only run one thread of execution at a time. Now in recent years we saw saw the arrival of multi-core systems and hyperthreading. These technologies allow us to run more than one thread at the very same time. It is as if there were two computers with two CPUs doing computation on their own. But what if you want to write a program that uses 100 threads? That is way more than typical hardware of today supports.
All threads are managed by a program called scheduler. This scheduler keeps track of all threads and then assigns each of them a time slice. This time slice is an amount of assigned time the thread can use for execution. After the time slice has expired, the scheduler kicks in again. It stops the running thread from executing and assigns a new time slice to new thread. As these time slices are quite short, it appears as if the threads were running in parallel. What is actually happening is different. The running threads are constantly switched out for other ones. This approach is refereed to preemptive threading. This is just like how digital video is really just one static images after another. Our computers don’t run every single program at the same time. It just appears that way.
A similar approach is cooperative threading. Here, the threads have to call a function on their own that yields the current thread for another one. There is no timer that kicks in in regular intervals. While this is easier to implement it has two big disadvantages. First of all it requires a lot of extra work for the programmer to write a program utilizing cooperative threading. You always have to think about when to yield. If you yield too often, your program will perform poorly. If you yield too rarely, other threads get to do no work. Secondly, if a thread enters an infinite loop without ever yielding no other thread will ever run again.
Kernel Level vs. User Level
When you call pthread_create(3)
in yor C program what you get is
most likely a kernel thread. A kernel thread is a thread managed by
the operating system kernel. The scheduler lives in the kernel. Things
are different with user threads. There, the scheduler lives in user
space. Now this has some interesting implications. Usually programs
running in user mode have less privileges than those running in kernel
mode. Because of this, user level threads probably aren’t able to
fully use the systems hardware.
What Makes a Thread
When a thread is running, it is working off a series of machine instructions. To remember which instruction is currently being executed, there exists the instruction pointer. The instruction pointer points at an address in memory containing the current instruction. This pointer is stored in a special register on the CPU die. There are other registers, too. They store information about the program stack or are just used for computation. When implementing user level threads, we will need to back up all this information. We need to save it when a thread yields execution and later restore when it resumes execution.
Lucky for us, Unix-like systems offer four functions related to the
ucontext_t
type. Such a ucontext_t
structure contains the context
of a thread, just what we need to restore it later. The four functions
are:
getcontext(3)
: Backs up the currently running contextsetcontext(3)
: Changes the currently executing context to a backed up one.makecontext(3)
: Creates a new context. It modifies an old context so it starts executing at a given function.swapcontext(3)
: Swaps out an old context for a new one. At the same time, it also backs up the old one.
To get a good understanding of these functions it is a good idea to read their manual pages. There you should also learn that using them is actually discouraged. You may even run into undefined behavior when using them. I find them to be useful for learning. But when it comes to writing real programs, please do use POSIX threads.
Implementation
Now that all of this is said, let’s get going. We will write a
user-level thread library with an interface consisting of just three
functions: threads_create
creates a new thread, threads_exit
exits
execution of the current thread and threads_join
joins two threads,
that is the calling thread waits for another thread to call
threads_exit
. The implementation is split in three C modules.
tcb Module
To keep track of all the threads we will create a thread control block for each of them. A thread control block contains everything needed to restore the thread to execution (the context). It also stores some other tidbits for bookkeeping.
typedef struct {
int id;
ucontext_t context;
bool has_dynamic_stack;
void *(*start_routine) (void *);
void *argument;
void *return_value;
} TCB;
id
is the identifier of the thread later returned by
threads_create
. context
and has_dynamic_stack
tells us something
about the context of the thread explained later. start_routine
is a
pointer to a function with one argument of type void *
and same
return type, argument
is the argument for that
start_routine
. return_value
will be used to store the result from
start_routine
. We need to store it so it can later be obtained
through threads_join
.
We must ensure that id
remains unique. So for creating these
structures, will use a function tcb_new
instead of just plain
allocating memory.
TCB *tcb_new(void)
{
static int next_id;
TCB *new;
if ((new = calloc(1, sizeof(TCB))) == NULL) {
return NULL;
}
new->id = next_id++;
return new;
}
If there is a function for creating it, we also need one for
destroying it. tcb_destroy
serves that purpose.
void tcb_destroy(TCB *block)
{
if (block->has_dynamic_stack) {
free(block->context.uc_stack.ss_sp);
}
free(block);
}
As we will see later, when creating a new thread it needs a stack of
its own. It is allocated dynamically using malloc(3)
. That too will
need to be freed when destroying the thread control block (and thus
the thread). Hence the line free(block->context.uc_stack.ss_sp)
.
queue Module
To keep track of these TCB
structures, we need some kind of data
structure. As this is just a bare-bones implementation, a simple queue
is enough. The interface is defined in queue.h
, the implementation
not very interesting.
typedef struct queue QUEUE;
/* Create a new initialized QUEUE on the heap. Returns a pointer
to the new block or NULL on error. */
QUEUE *queue_new(void);
/* Destroy queue, freeing all associated memory with it. It also
frees all memory of the elements inside the queue. */
void queue_destroy(QUEUE *queue);
/* Return the number of items in queue. */
size_t queue_size(const QUEUE *queue);
/* Add elem to the end of queue. Returns 0 on succes and non-zero
on failure.*/
int queue_enqueue(QUEUE *queue, TCB *elem);
/* Remove the first item from the queue and return it. The caller
will have to free the reuturned element. Returns NULL if the
queue is empty. */
TCB *queue_dequeue(QUEUE *queue);
/* Remove element with matching id from the queue. Returns the
removed element or NULL if no such element exists. */
TCB *queue_remove_id(QUEUE *queue, int id);
threads module
In our threads.c
module we create two global queues:
static QUEUE *ready, *completed;
ready
contains all threads that are ready for execution. completed
contains the threads that have called threads_exit
but were not yet
joined on.
To keep track of the currently running thread, we also create a global reference to a thread control block.
static TCB *running;
To access the currently running thread, we take a look at running
.
The threads currently waiting to get a slice of the CPU pie are
waiting in ready
. As they are these three globals are uninitialized.
We will change that later in threads_create
.
Setting an Alarm
We will implement preemptive threading using Unix signals. If you have
played around with Unix signals you may have used alert(2)
. After
calling it, an alarm is set and then after n seconds the SIGALRM
signal is raised. This is almost what we want, but not
quite. alert(2)
uses real time, it works like an actual alarm
clock. But remember that there are many other processes running on
your computer managed the by the kernel and its scheduler. It makes no
sense to wait n seconds and then switch out the currently running
user thread when that user thread hasn’t done any work at all yet.
Rather we will use a profiling timer with the help of
getitimer(2)
. It only raises after a certain time has been
consumed by the thread.
static bool init_profiling_timer(void)
{
// Install signal handler
sigset_t all;
sigfillset(&all);
const struct sigaction alarm = {
.sa_sigaction = handle_sigprof,
.sa_mask = all,
.sa_flags = SA_SIGINFO | SA_RESTART
};
struct sigaction old;
if (sigaction(SIGPROF, &alarm, &old) == -1) {
perror("sigaction");
abort();
}
const struct itimerval timer = {
{ 0, 10000 },
{ 0, 1 } // arms the timer as soon as possible
};
// Enable timer
if (setitimer(ITIMER_PROF, &timer, NULL) == - 1) {
if (sigaction(SIGPROF, &old, NULL) == -1) {
perror("sigaction");
abort();
}
return false;
}
return true;
}
After calling this function, every 10000 µs of execution a SIGPROF
signal is emitted. The signal is then handled in the handle_sigprof
signal handler.
static void handle_sigprof(int signum, siginfo_t *nfo, void *context)
{
int old_errno = errno;
// Backup the current context
ucontext_t *stored = &running->context;
ucontext_t *updated = (ucontext_t *) context;
stored->uc_flags = updated->uc_flags;
stored->uc_link = updated->uc_link;
stored->uc_mcontext = updated->uc_mcontext;
stored->uc_sigmask = updated->uc_sigmask;
// Round robin
if (queue_enqueue(ready, running) != 0) {
abort();
}
if ((running = queue_dequeue(ready)) == NULL) {
abort();
}
// Manually leave the signal handler
errno = old_errno;
if (setcontext(&running->context) == -1) {
abort();
}
}
What is important to understand here is that signal handlers usually
run in a context of their own. It is different from the context of the
thread that was running before the program got the signal. If we were
to call getcontext(3)
we could get the wrong context. Instead we use
the last parameter of the signal handler, context
. It is a pointer
to an ucontext_t
structure representing the context before the
signal handler was run.
At least that is how it used to be, to quote the man page of
getcontext(3)
on my Debian machine:
If the context was obtained by a call to a signal handler, then old standard text says that “program execution continues with the program instruction following the instruction interrupted by the signal”. However, this sentence was removed in SUSv2, and the present verdict is “the result is unspecified”.
We can still use this argument in our quest to understand the idea behind threading. Just don’t use it in any production code!
At first the old context is backed up by updating the ucontext_t
structure of the running thread. The running thread control block is
then put at the end of the running queue. To get the next thread in
line, we dequeue a thread control block from the ready
queue. This
dequeued thread becomes the new running thread. We update running
and then call setcontext
on the address of running->context
. This
is all our scheduling logic amounts to. Every thread gets an equal
amount of time (they are all of equal priority) and they all work in
order. This kind of bare-bones scheduling is usually called round
robin scheduling.
Synchronization
After calling init_profiling_timer
, every 10000 µs of execution the
currently running thread gets interrupted. This is what we want, but
if at the same time we modify any of the queues, the queue could get
corrupted. Because of this we will block SIGPROF
in some
functions. Unix offers the sigprocmask
function to do this. Using it
is a bit bothersome, so let’s write two helper functions.
static void block_sigprof(void)
{
sigset_t sigprof;
sigemptyset(&sigprof);
sigaddset(&sigprof, SIGPROF);
if (sigprocmask(SIG_BLOCK, &sigprof, NULL) == -1) {
perror("sigprocmask");
abort();
}
}
static void unblock_sigprof(void)
{
sigset_t sigprof;
sigemptyset(&sigprof);
sigaddset(&sigprof, SIGPROF);
if (sigprocmask(SIG_UNBLOCK, &sigprof, NULL) == -1) {
perror("sigprocmask");
abort();
}
}
Creating a Thread
When threads_create
gets called for the first time, the threads
module is still uninitialized. There are three things that need to be
initialized. First there are the queues ready
and completed
. A
helper function will take care of that.
static bool init_queues(void)
{
if ((ready = queue_new()) == NULL) {
return false;
}
if ((completed = queue_new()) == NULL) {
queue_destroy(ready);
return false;
}
return true;
}
Secondly, we need to threadify, for lack of a better word, the
original context. When the program first starts execution we can think
of it as running in a single thread. But our libray knows nothing
about this yet. We write a function init_first_context
that changes
that. In it, the currently running context is backed up in a thread
control block. As indeed this first thread is the thread currently
running, the running
global is set to it.
static bool init_first_context(void)
{
TCB *block;
if ((block = tcb_new()) == NULL) {
return false;
}
if (getcontext(&block->context) == -1) {
tcb_destroy(block);
return false;
}
running = block;
return true;
}
The last thing we will initialize is the timer. For this we will use
the already discussed function init_profiling_timer
.
Now we can implement threads_create
. The first third of the function
takes care of initialization using the fucntions from earlier.
int threads_create(void *(*start_routine) (void *), void *arg)
{
block_sigprof();
// Init if necessary
static bool initialized;
if (! initialized) {
if (! init_queues()) {
abort();
}
if (! init_first_context()) {
abort();
}
if (! init_profiling_timer()) {
abort();
}
initialized = true;
}
Now we need to actually create the new thread, that is the thread
control block. Once created, we use getcontext(3)
to capture the
current context, allocate space used as that threads stack and then
use makecontext(3)
to change the captured context to fit with the
parameters of the new thread.
// Create a thread control block for the newly created thread.
TCB *new;
if ((new = tcb_new()) == NULL) {
return -1;
}
if (getcontext(&new->context) == -1) {
tcb_destroy(new);
return -1;
}
if (! malloc_stack(new)) {
tcb_destroy(new);
return -1;
}
makecontext(&new->context, handle_thread_start, 1, new->id);
new->start_routine = start_routine;
new->argument = arg;
// Enqueue the newly created stack
if (queue_enqueue(ready, new) != 0) {
tcb_destroy(new);
return -1;
}
unblock_sigprof();
return new->id;
}
We used two so far not mentioned functions here: malloc_stack
and
handle_thread_start
as a paramter of the makecontext(3)
call.
The first function, malloc_stack
just allocates memory on the heap
later used as a stack and then modifies the context stored in the
thread control block to include that stack.
static bool malloc_stack(TCB *thread)
{
// Get the stack size
struct rlimit limit;
if (getrlimit(RLIMIT_STACK, &limit) == -1) {
return false;
}
// Allocate memory
void *stack;
if ((stack = malloc(limit.rlim_cur)) == NULL) {
return false;
}
// Update the thread control bock
thread->context.uc_stack.ss_flags = 0;
thread->context.uc_stack.ss_size = limit.rlim_cur;
thread->context.uc_stack.ss_sp = stack;
thread->has_dynamic_stack = true;
return true;
}
ss_size
is the size of the stack, ss_sp
the beginning of the stack
and ss_flags
set to 0
indicates a new stack. See the manual page
for getcontext(3)
for details.
The second and more interesting function is handle_thread_start
used
with makecontext(3)
. makecontext(3)
takes a ucontext_t
structure
usually obtained through getcontext(3)
and modifies it. Once the
modified context is activated, execution starts at function func
with variable number of int
arguments. These arguments were passed
to makecontext(3)
. Again, details are described in the manual page.
You may have noticed that the interface of makecontext(3)
doesn’t
match our POSIX-inspired interface. makecontext(3)
starts in a
function with int
arguments, but we want void*
. For this reason,
we need a little wrapper, handle_thread_start
.
static void handle_thread_start(void)
{
block_sigprof();
TCB *this = running;
unblock_sigprof();
void *result = this->start_routine(this->argument);
threads_exit(result);
}
This wrapper actually does a bit more than that, it also calls
threads_exit
. This way, every thread created with our library is
guaranteed to call threads_exit
. Even if the thread only returns.
The only exception here is the first thread which is not run through
this wrapper. Remember it was created with init_first_context
. In
our implementation the first thread is a bit of a special snowflake,
when it exits the whole program exits. There are ways around this, but
we won’t do that here.
threads_exit
is fairly simple. First, it enqueues the currently
running thread control block in the completed
queue. It then yields
the processor to the next thread in line.
void threads_exit(void *result)
{
if (running == NULL) {
exit(EXIT_SUCCESS);
}
block_sigprof();
running->return_value = result;
if (queue_enqueue(completed, running) != 0) {
abort();
}
if ((running = queue_dequeue(ready)) == NULL) {
exit(EXIT_SUCCESS);
}
setcontext(&running->context); // also unblocks SIGPROF
}
Last thing we need to implement is threads_join
.
int threads_join(int id, void **result)
{
if (id < 0) {
errno = EINVAL;
return -1;
}
block_sigprof();
TCB *block = queue_remove_id(completed, id);
unblock_sigprof();
if (block == NULL) {
return 0;
} else {
*result = block->return_value;
tcb_destroy(block);
return id;
}
}
All threads_join
does is look through the queue completed
. If an
element with matching id
is found, that thread control block is
destroyed for good. The return value is written into result
.
Full Code
The source code is now available on Github at github.com/kissen/threads.