Invisible to the eye: On property-based testing a highly concurrent job queue

Exploring Eris

Recruiter is a job queue written in PHP, open sourced by Onebip in its 2.x version. It has been used on some inward facing production services, but following the necessity to roll it out to more and more project, I have started a thorough testing campaigns to flush out possible concurrency bugs.
The job system is composed of a single Recruiter process and multiple Worker processes (moreover any other PHP process can enqueue a job). These processes may run on any machine inside a local network and share a MongoDB database where they collaborate to empty the collection of jobs to do. The design of these multiple collections is carefully tuned for scalability, as in its first version Recruiter used heavier findAndModify operations which is now free of.

Property-based testing

Testing some happy paths such as adding a job and executing it is fine for test-driving the code, but it's nowhere near enough for quality assurance. On system of any appreciable scale and/or quality, testing is a separate additional activity (that can hopefully be performed by developers wearing a different hat, or in any case inside a single cross-functional team.)
To test highly concurrent processes such as a recruiter and its dozens of workers insisting on the same database, we adopted Eris, the open source PHP QuickCheck implementation developed by me and some colleagues. Eris is able to generate random inputs for the System Under Test, according to a specification provided by the tester; it supports property-based testing which drives the system with this input while checking important properties are respected.
In this scenario, we generated a random sequence of actions to perform over these processes, checking invariants and post-conditions of operations. For example, an invariant is there is never more than one recruiter process alive. There are surprisingly few invariants when you work with distributed systems; as another example consider the number of workers registered in the related MongoDB collection. This number is not fixed, as crashed processes may still be present even if dead, as long as the rest of the system didn't detect the crash yet.
One postcondition of the job system is very important: any job enqueued is eventually executed, preferably as soon as possible. In these tests, we focused on testing the correctness of this property and not the performance. We monitor the collection of archived jobs (which have been executed correctly) and check that it fills up with all the jobs we expect. The timeout after which we declare the test failed is tuned to the total number of actions performed, which is random.
There are more advanced approaches such as generating a sequential prefix plus a few parallel sequences of actions. This could give more control over the process and may enable some form of shrinking with better determinism; however we retain a notion of parallelism by creating multiple processes. Unfortunately each run is non-deterministic as processes and the underlying MongoDB instance can be scheduled differently by the operating system, changing the interleave of their operation; therefore shrinking is not possible, or is possible only at the cost of running shrunk sequences multiple times to reliably mark them as passing.

Iteration: random number of jobs, graceful restarts

In the first version of the test, we generated a random number of identical jobs (executing a "echo 42" command), along with a series of restarts of the recruiter and a single worker process using SIGTERM. The jobs were enqueued serially by the test process, along with the restart actions. In theory, the processes intercept the signals and exit after having finished their current cycle of polling or execution.
Here are the bugs that we found:

Jobs locked but not assigned. Since two different collections are at play for scalability reasons, one of jobs and one of workers, a worker may terminate after it is selected for a job but before it is written into its collection. https://github.com/onebip/recruiter/commit/ef490f4acde1b7d00d9b40aee5b2f02257f67d70
Worker timeout. Its polling should stay at the maximum frequency when there are jobs being executed and only back off when no jobs are found. https://github.com/onebip/recruiter/commit/be13df0d818536bce45050495be1183d109a504d
Recruiter high latency (>45s) in edge cases caused by the maximum back off increasing too much. This is not strictly a correctness problem but solving it improved the reliability of tests, which are timeout-based. https://github.com/onebip/recruiter/commit/8bb11f5c243ebce834eaeadbc1c5c98815581b1d
Active Record overwriting the worker document inside its collection when polling, possibly losing the update of the last assigned job. https://github.com/onebip/recruiter/commit/0c950595a475f36778641f49c8b369135a329f8a
Signals should be handled before interacting with the database, so that graceful shutdowns can be performed even in the startup phases of the recruiter and worker processes (same commit as before). https://github.com/onebip/recruiter/commit/0c950595a475f36778641f49c8b369135a329f8a
Worker retiring before executing an assigned job. https://github.com/onebip/recruiter/commit/8b23399b567db23d368c4622063eb80db8e8666a
Error handling code that was never executed, referencing a not existent class name. https://github.com/onebip/recruiter/commit/e8f85ddea7c574991a597b8a386e1ebfe72666d6

Iteration: multiple workers

Once the test suite was consistently green, we extended the testing environment by allowing multiple workers to be created and correctly restarted.
We found an additional problem with this extension:

Assigned jobs must be a numeric array. When passing an array of job assignments it can get the form [0 => ..., 1 => ..., 3 => ...] and the missing index (coming from a deletion) makes it invalid for a MongoDB query, which fails. https://github.com/onebip/recruiter/commit/2b9af50985542caddb3395a80ad0131c0f666bec

Iteration: crashing workers

We added the possibility of killing a worker with SIGKILL, immediately interrupting it even in the middle of database updates.
The possibility of a worker crashing was already covered by the code. However, we tuned the timeout period after which workers are considered dead while inside the test suite; we set it to dozens of seconds instead of half an hour to allow for sane waiting periods in the test process.

Iteration: crashing the recruiter

Killing the single recruiter process was interesting because it usually takes a lock (in the form of a document inside a MongoDB collection with a unique index) to avoid accidental multiple executions. The process correctly waited on the previous lock to expire before restarting, but...

An off-by-one error bug in the related MongoLock class. A lock could be waited upon but then failing on the acquisition attempt without any other process try to steal it. https://github.com/onebip/onebip-concurrency/commit/d7af1a37a5ae2c17e77a9545006c6abeb8cf95e3

Iteration: length of jobs

We introduced also a random length for enqueued jobs (sleeping from 0ms to 1000ms instead of executing a fixed command). At this point we did not find additional bugs at the time of this post, with the test suite running for several hours, exploring new random possible sequences of actions.

Final version

The final version of the test composes an Eris Generator that:

generates a number of workers to start between 1 and 4.
using this number, creates a new Generator that produces a tuple (in this case a pair, which means an array of two elements of disparate types). The tuple contains the number of workers itself and the list of actions.

The list of actions is a sequence of a random number of elements, where each of the elements can in turn be an action representing:

a job to enqueue with an expected duration of a positive number of milliseconds
a graceful restart of one of the workers
a graceful restart of the recruiter
a kill -9 on one of the worker processes
a kill -9 on the recruiter process
a sleep of a number of milliseconds between 0 and 1000

Here is an example of action sequence:

[ACTIONS][PHPUNIT][2016-03-14T12:11:14+01:00] ["enqueueJob",8]
[ACTIONS][PHPUNIT][2016-03-14T12:11:14+01:00] "restartRecruiterGracefully"

While here is a moderately complex example:

[ACTIONS][PHPUNIT][2016-03-14T12:11:59+01:00] ["restartWorkerByKilling",0]
[ACTIONS][PHPUNIT][2016-03-14T12:11:59+01:00] ["restartWorkerByKilling",0]
[ACTIONS][PHPUNIT][2016-03-14T12:11:59+01:00] "restartRecruiterGracefully"
[ACTIONS][PHPUNIT][2016-03-14T12:12:00+01:00] ["enqueueJob",7]
[ACTIONS][PHPUNIT][2016-03-14T12:12:00+01:00] ["restartWorkerByKilling",0]
[ACTIONS][PHPUNIT][2016-03-14T12:12:00+01:00] ["enqueueJob",13]
[ACTIONS][PHPUNIT][2016-03-14T12:12:00+01:00] "restartRecruiterByKilling"
[ACTIONS][PHPUNIT][2016-03-14T12:12:00+01:00] ["restartWorkerGracefully",0]
[ACTIONS][PHPUNIT][2016-03-14T12:12:00+01:00] ["restartWorkerByKilling",0]
[ACTIONS][PHPUNIT][2016-03-14T12:12:00+01:00] "restartRecruiterGracefully"
[ACTIONS][PHPUNIT][2016-03-14T12:12:10+01:00] ["sleep",860]
[ACTIONS][PHPUNIT][2016-03-14T12:12:11+01:00] "restartRecruiterByKilling"
[ACTIONS][PHPUNIT][2016-03-14T12:12:12+01:00] ["restartWorkerByKilling",0]
[ACTIONS][PHPUNIT][2016-03-14T12:12:12+01:00] ["enqueueJob",0]
[ACTIONS][PHPUNIT][2016-03-14T12:12:12+01:00] ["restartWorkerByKilling",0]
[ACTIONS][PHPUNIT][2016-03-14T12:12:12+01:00] ["enqueueJob",5]
[ACTIONS][PHPUNIT][2016-03-14T12:12:12+01:00] "restartRecruiterGracefully"
The parameter in the steps modelled as arrays is the duration of a job, or the number of the worker in case of restarting actions.

The test generates 100 of these sequences (this number is tunable, or can target a time limit). For each of them it creates an empty database, starts the workers and the recruiter, performs the actions and waits for all jobs to be performed. If the timeout for full execution expires, the test is marked as failed and lists the log files to look at to understand what happened. On my machine, the test now terminates in about one hour, with a green bar.

Conclusions

Testing is an important activity and can increase the quality of your software by removing bugs before they can get to one of your customers. Testing is becoming more and more incorporated in the lifes of developers (see Test-Driven Development and Behavior-Driven Development), but for core domains and infrastructure additional activities are required for stress and performance tests comparable to production traffic.
It is however impossible to write by hand tests for all the possible situations; however you can easily build a reasonable model of the input to your system. So let me quote John Hughes in saying "Don't write tests. Generate them"; with property-based testing you can write one test containing one property, and catch dozens of bugs like in this post's case study.

Invisible to the eye

Monday, March 14, 2016

On property-based testing a highly concurrent job queue