The Pthread Workqueue Regulator is meant to help userland regulate thread pools based on the actual amount of threads that are running, the capacity of the machines, the amount of blocked threads …

kernel-land design

In the kernel, threads registered in the pwq regulator can be in 5 states:

blocked

This is the state of threads that are curently blocked in a syscall.

running

This is the state of threads that are either really running, or have been preempted out by the kernel. In other words it’s the number of schedulable threads.

waiting

This is the state of threads that are currently in a PWQR_CTL_WAIT call from userspace (see pwqr_ctl) but that would not overcommit if released by a PWQR_CTL_WAKE call.

quarantined

This is the state of threads that are currently in a PWQR_CTL_WAIT call from userspace (see pwqr_ctl) but that would overcommit if released by a PWQR_CTL_WAKE call.

This state avoids waking a thread to force userland to "park" the thread, this is racy, make the scheduler work for nothing useful. Though if PWQR_CTL_WAKE is called, quarantined threads are woken but with a EDQUOT errno set, and only one by one, no matter how wakes have been asked.

This state actually has only one impact: when PWQR_CTL_WAKE is called for more than one threads, for example 4, and that userland knows that there is 5 threads in WAIT state, but that actually 3 of them are in the quarantine, only 2 will be woken up, and the PWQR_CTL_WAKE call will return 2. Any subsequent PWQR_CTL_WAKE call will wake up one quarantined thread to let it be parked, but returning 0 each time to hide that from userland.

parked

This is the state of threads currently in a PWQR_CTL_PARK call from userspace (see pwqr_ctl).

The regulator tries to maintain the following invariant:

    running + waiting == target_concurrency
||  (running + waiting < target_concurrency && waiting > 0)
When running + waiting overcommits

The kernel puts waiting threads into the quarantine, which doesn’t require anything from userland. It’s something userland discovers only when it needs a waiting thread, which may never happen.

If there are no waiting threads, then well, the workqueue overcommits, and the pwqr file descriptor is made available for reading. When userlands reads, it will get the amount of overcommit at the time of the read(3) call.

Note
the availability for reading is only done after a delay (0.05s roughly in the current implementation). And once you’ve read, this enables a new notification in another 0.05s and so on, to allow a slow decrease of the number of threads.
When running + waiting undercommits

If waiting is non-zero then well, we don’t care, it’s that userland actually doesn’t need work to be performed.

If waiting is zero, then a parked thread (if such a thread) is woken up so that userland has a chance to consume jobs.

Unparking threads only when waiting becomes zero avoid flip-flops when the job flow is small, and that some of the running threads sometimes blocks (IOW running sometimes decreases, making running + waiting be below target concurrency for very small amount of time).

Note
unparking only happens after a delay (0.1s in the current implementation) during which waiting must have been remained zero and the overal load to be under commiting resources for the whole period.

The regulation between running and waiting threads is left to userspace that is a way better judge than kernel land that has absolutely no knowledge about the current workload. Also, doing so means that when there are lots of jobs to process and that the pool has a size that doesn’t require more regulation, kernel isn’t called for mediation/regulation AT ALL.

Todos

When we’re overcommiting for a "long" time, userspace should be notified in some way it should try to reduce its amount of running threads. Note that the Apple implementation (before Lion at least) has the same issue. Though if you imagine someone that spawns a zillion jobs that call very slow msync()s or blocking read()s over the network, then that all those go back to running state, the overcommit is huge.

There are several ways to "fix" this:

in kernel (poll solution, actually implemented right now)

Let the file descriptor be pollable, and let it be readable (returning something like the amount of overcommit at read() time for example) so that userland is notified that it should try to reduce the amount of runnable threads.

It sounds very easy, but it has one major drawback: it meaks the pwqfd must be somehow registered into the eventloop, and it’s not very suitable for a pthread_workqueue implementation. In other words, if you can plug into the event-loop because it’s a custom one or one that provides thread regulation then it’s fine, if you can’t (glib, libdispatch, …) then you need a thread that will basically just poll() on this file-descriptor, it’s really wasteful.

Note
this has been implemented now, but still it looks "expensive" to hook for some users. So if some alternative way to be signalled could exist, it’d be really awesome.
in userspace

Userspace knows how many "running" threads there are, it’s easy to track the amount of registered threads, and parked/waiting threads are already accounted for. When "waiting" is zero, if "registerd - parked" is "High" userspace could choose to randomly try to park one thread.

userspace can use non blocking read() to probe if it’s overcommiting.

It’s in NONE when userspace belives it’s not necessary to probe (e.g. when the amount of running + waiting threads isn’t that large, say less than 110% of the concurrency or any kind of similar rule).

It’s in SLOW mode else. In slow mode each thread does a probe every 32 or 64 jobs to mitigate the cost of the syscall. If the probe returns 1 then ask for down-commiting and stay in SLOW mode, if it returns AGAIN all is fine, if it returns more than 1 ask for down-commiting and go to AGGRESSIVE.

When AGGRESSVE threads check if they must park more often and in a more controlled fashion (every 32 or 64 jobs isn’t nice because jobs can be very long), for example based on some poor man’s timer (clock_gettime(MONOTONIC) sounds fine). State transition works as for SLOW.

The issue I have with this is that it sounds to add quite some code in the fastpath code, hence I dislike it a lot.

my dream

To be able to define a new signal we could asynchronously send to the process. The signal handler would just put some global flag to 1, the threads in turn would check for this flag in their job consuming loop, and the first thread that sees it to 1, xchg()s 0 for it, and goes to PARK mode if it got the 1. It’s fast, inexpensive.

Sadly AFAICT defining new signals() isn’t such a good idea. Another possibility is to give an address for the flag at pwqr_create() time and let the kernel directly write into userland. The problem is, I feel like it’s a very wrong interface somehow. I should ask some kernel hacker to know if that would be really frowned upon. If not, then that’s the leanest solution of all.

pwqr_create

SYNOPSIS

int pwqr_create(int flags);

DESCRIPTION

This call returns a new PWQR file-descriptor. The regulator is initialized with a concurrency corresponding to the number of online CPUs at the time of the call, as would be returned by sysconf(_SC_NPROCESSORS_ONLN).

flags

a mask of flags among PWQR_FL_CLOEXEC, and PWQR_FL_NONBLOCK.

Available operations on the pwqr file descriptor are:

poll, epoll and friends

the PWQR file descriptor can be watched for POLLIN events (not POLLOUT ones as it can not be written to).

read

The file returned can be read upon. The read blocks (or fails setting EAGAIN if in non blocking mode) until the regulator believes the pool is overcommitting. The buffer passed to read should be able to hold an integer. When read(3) is successful, it returns in the buffer passed to read(3) the amount of overcommiting threads (understand: the number of threads to park so that the pool isn’t overcommiting anymore).

RETURN VALUE

On success, this call return a nonnegative file descriptor. On error, -1 is returned, and errno is set to indicate the error.

ERRORS

[EINVAL]

Invalid value specified in flags

[ENFILE]

The system limit on the total number of open files has been reached.

[ENOMEM]

There was insufficient memory to create the kernel object.

pwqr_ctl

SYNOPSIS

int pwqr_ctl(int pwqrfd, int op, int val, void *addr);

DESCRIPTION

This system call performs control operations on the pwqr instance referred to by the file descriptor pwqrfd.

Valid values for the op argument are:

PWQR_CTL_GET_CONC

Requests the current concurrency level for this regulator.

PWQR_CTL_SET_CONC

Modifies the current concurrency level for this regulator. The new value is passed as the val argument. The requests returns the old concurrency level on success.

A zero or negative value for val means automatic and is recomputed as the current number of online CPUs as sysconf(_SC_NPROCESSORS_ONLN) would return.

PWQR_CTL_REGISTER

Registers the calling thread to be taken into account by the pool regulator. If the thread is already registered into another regulator, then it’s automatically unregistered from it.

PWQR_CTL_UNREGISTER

Deregisters the calling thread from the pool regulator.

PWQR_CTL_WAKE

Tries to wake val threads from the pool. This is done according to the current concurrency level not to overcommit. On success, a hint of the number of woken threads is returned, it can be 0.

This is only a hint of the number of threads woken up for two reasons. First, the kernel could really have woken up a thread, but when it becomes scheduled, it could then decide that it would overcommit (because some other thread unblocked inbetween for example), and block it again.

But it can also lie in the other direction: userland is supposed to account for waiting threads. So when we’re overcommiting and userland want a waiting thread to be unblocked, we actually say we woke none, but still unblock one (the famous quarantined threads we talk about above). This allow the userland counter of waiting threads to decrease, but we know the thread won’t be usable so we return 0.

PWQR_CTL_WAKE_OC

Tries to wake val threads from the pool. This is done bypassing the current concurrency level (OC stands for OVERCOMMIT). On success, the number of woken threads is returned, it can be 0, but it’s the real count that has been (or will soon be) woken up. If it’s less than required, it’s because there aren’t enough parked threads.

PWQR_CTL_WAIT

Puts the thread to wait for a future PWQR_CTL_WAKE command. If this thread must be parked to maintain concurrency below the target, then the call blocks with no further ado.

If the concurrency level is below the target, then the kernel checks if the address addr still contains the value val (in the fashion of futex(2)). If it doesn’t then the call doesn’t block. Else the calling thread is blocked until a PWQR_CTL_WAKE command is received.

addr must of course be a pointer to an aligned integer which stores the reference ticket in userland.

PWQR_CTL_PARK

Puts the thread in park mode. Those are spare threads to avoid cloning/exiting threads when the pool is regulated. Those threads are released by the regulator only, and can only be woken from userland with the PWQR_CTL_WAKE_OC command, and once all waiting threads have been woken.

The call blocks until an overcommiting wake requires the thread, or the kernel regulator needs to grow the pool with new running threads.

RETURN VALUE

When successful pwqr_ctl returns a nonnegative value. On error, -1 is returned, and errno is set to indicate the error.

ERRORS

[EBADF]

pwqfd is not a valid file descriptor.

[EBADFD]

pwqfd is a valid pwqr file descriptor but is in a broken state: it has been closed while other threads were in a pwqr_ctl call.

Note
this is due to the current implementation and would probably not be here with a real syscall.
[EFAULT]

Error in reading value from addr from userspace.

[EINVAL]

TODO

Errors specific to PWQR_CTL_REGISTER:

[ENOMEM]

There was insufficient memory to perform the operation.

Errors specific to PWQR_CTL_WAIT:

[EWOULDBLOCK]

When the kernel evaluated if addr still contained val it didn’t. This works like futex(2).

Errors specific to PWQR_CTL_WAIT and PWQR_CTL_PARK:

[EINTR]

The call was interrupted by a syscall (note that sometimes the kernel masks this fact when it has more important "errors" to report like EDQUOT).

[EDQUOT]

The thread has been woken by a PWQR_CTL_WAKE or PWQR_CTL_WAKE_OC call, but is overcommiting.