Monday, March 14, 2016

On property-based testing a highly concurrent job queue

Exploring Eris
Recruiter is a job queue written in PHP, open sourced by Onebip in its 2.x version. It has been used on some inward facing production services, but following the necessity to roll it out to more and more project, I have started a thorough testing campaigns to flush out possible concurrency bugs.
The job system is composed of a single Recruiter process and multiple Worker processes (moreover any other PHP process can enqueue a job). These processes may run on any machine inside a local network and share a MongoDB database where they collaborate to empty the collection of jobs to do. The design of these multiple collections is carefully tuned for scalability, as in its first version Recruiter used heavier findAndModify operations which is now free of.

Property-based testing

Testing some happy paths such as adding a job and executing it is fine for test-driving the code, but it's nowhere near enough for quality assurance. On system of any appreciable scale and/or quality, testing is a separate additional activity (that can hopefully be performed by developers wearing a different hat, or in any case inside a single cross-functional team.)
To test highly concurrent processes such as a recruiter and its dozens of workers insisting on the same database, we adopted Eris, the open source PHP QuickCheck implementation developed by me and some colleagues. Eris is able to generate random inputs for the System Under Test, according to a specification provided by the tester; it supports property-based testing which drives the system with this input while checking important properties are respected.
In this scenario, we generated a random sequence of actions to perform over these processes, checking invariants and post-conditions of operations. For example, an invariant is there is never more than one recruiter process alive. There are surprisingly few invariants when you work with distributed systems; as another example consider the number of workers registered in the related MongoDB collection. This number is not fixed, as crashed processes may still be present even if dead, as long as the rest of the system didn't detect the crash yet.
One postcondition of the job system is very important: any job enqueued is eventually executed, preferably as soon as possible. In these tests, we focused on testing the correctness of this property and not the performance. We monitor the collection of archived jobs (which have been executed correctly) and check that it fills up with all the jobs we expect. The timeout after which we declare the test failed is tuned to the total number of actions performed, which is random.
There are more advanced approaches such as generating a sequential prefix plus a few parallel sequences of actions. This could give more control over the process and may enable some form of shrinking with better determinism; however we retain a notion of parallelism by creating multiple processes. Unfortunately each run is non-deterministic as processes and the underlying MongoDB instance can be scheduled differently by the operating system, changing the interleave of their operation; therefore shrinking is not possible, or is possible only at the cost of running shrunk sequences multiple times to reliably mark them as passing.

Iteration: random number of jobs, graceful restarts

In the first version of the test, we generated a random number of identical jobs (executing a "echo 42" command), along with a series of restarts of the recruiter and a single worker process using SIGTERM. The jobs were enqueued serially by the test process, along with the restart actions. In theory, the processes intercept the signals and exit after having finished their current cycle of polling or execution.
Here are the bugs that we found:

Iteration: multiple workers

Once the test suite was consistently green, we extended the testing environment by allowing multiple workers to be created and correctly restarted.
We found an additional problem with this extension:

Iteration: crashing workers

We added the possibility of killing a worker with SIGKILL, immediately interrupting it even in the middle of database updates.
The possibility of a worker crashing was already covered by the code. However, we tuned the timeout period after which workers are considered dead while inside the test suite; we set it to dozens of seconds instead of half an hour to allow for sane waiting periods in the test process.

Iteration: crashing the recruiter

Killing the single recruiter process was interesting because it usually takes a lock (in the form of a document inside a MongoDB collection with a unique index) to avoid accidental multiple executions. The process correctly waited on the previous lock to expire before restarting, but...

Iteration: length of jobs

We introduced also a random length for enqueued jobs (sleeping from 0ms to 1000ms instead of executing a fixed command). At this point we did not find additional bugs at the time of this post, with the test suite running for several hours, exploring new random possible sequences of actions.

Final version

The final version of the test composes an Eris Generator that:
  • generates a number of workers to start between 1 and 4.
  • using this number, creates a new Generator that produces a tuple (in this case a pair, which means an array of two elements of disparate types). The tuple contains the number of workers itself and the list of actions.
The list of actions is a sequence of a random number of elements, where each of the elements can in turn be an action representing:
  • a job to enqueue with an expected duration of a positive number of milliseconds
  • a graceful restart of one of the workers
  • a graceful restart of the recruiter
  • a kill -9 on one of the worker processes
  • a kill -9 on the recruiter process
  • a sleep of a number of milliseconds between 0 and 1000
Here is an example of action sequence:

[ACTIONS][PHPUNIT][2016-03-14T12:11:14+01:00] ["enqueueJob",8]
[ACTIONS][PHPUNIT][2016-03-14T12:11:14+01:00] "restartRecruiterGracefully"


While here is a moderately complex example:

[ACTIONS][PHPUNIT][2016-03-14T12:11:59+01:00] ["restartWorkerByKilling",0]
[ACTIONS][PHPUNIT][2016-03-14T12:11:59+01:00] ["restartWorkerByKilling",0]
[ACTIONS][PHPUNIT][2016-03-14T12:11:59+01:00] "restartRecruiterGracefully"
[ACTIONS][PHPUNIT][2016-03-14T12:12:00+01:00] ["enqueueJob",7]
[ACTIONS][PHPUNIT][2016-03-14T12:12:00+01:00] ["restartWorkerByKilling",0]
[ACTIONS][PHPUNIT][2016-03-14T12:12:00+01:00] ["enqueueJob",13]
[ACTIONS][PHPUNIT][2016-03-14T12:12:00+01:00] "restartRecruiterByKilling"
[ACTIONS][PHPUNIT][2016-03-14T12:12:00+01:00] ["restartWorkerGracefully",0]
[ACTIONS][PHPUNIT][2016-03-14T12:12:00+01:00] ["restartWorkerByKilling",0]
[ACTIONS][PHPUNIT][2016-03-14T12:12:00+01:00] "restartRecruiterGracefully"
[ACTIONS][PHPUNIT][2016-03-14T12:12:10+01:00] ["sleep",860]
[ACTIONS][PHPUNIT][2016-03-14T12:12:11+01:00] "restartRecruiterByKilling"
[ACTIONS][PHPUNIT][2016-03-14T12:12:12+01:00] ["restartWorkerByKilling",0]
[ACTIONS][PHPUNIT][2016-03-14T12:12:12+01:00] ["enqueueJob",0]
[ACTIONS][PHPUNIT][2016-03-14T12:12:12+01:00] ["restartWorkerByKilling",0]
[ACTIONS][PHPUNIT][2016-03-14T12:12:12+01:00] ["enqueueJob",5]
[ACTIONS][PHPUNIT][2016-03-14T12:12:12+01:00] "restartRecruiterGracefully"

The parameter in the steps modelled as arrays is the duration of a job, or the number of the worker in case of restarting actions.

The test generates 100 of these sequences (this number is tunable, or can target a time limit). For each of them it creates an empty database, starts the workers and the recruiter, performs the actions and waits for all jobs to be performed. If the timeout for full execution expires, the test is marked as failed and lists the log files to look at to understand what happened. On my machine, the test now terminates in about one hour, with a green bar.

Conclusions

Testing is an important activity and can increase the quality of your software by removing bugs before they can get to one of your customers. Testing is becoming more and more incorporated in the lifes of developers (see Test-Driven Development and Behavior-Driven Development), but for core domains and infrastructure additional activities are required for stress and performance tests comparable to production traffic.
It is however impossible to write by hand tests for all the possible situations; however you can easily build a reasonable model of the input to your system. So let me quote John Hughes in saying "Don't write tests. Generate them"; with property-based testing you can write one test containing one property, and catch dozens of bugs like in this post's case study.

No comments:

Post a Comment