How Rama is tested: a primer on testing distributed systems

There are a number of properties that are non-negotiable in any software system, such as no data loss, no stalling, and timely recovery from faults. These properties are particularly difficult to achieve in a distributed system, as the vast number of permutations of timings, event orderings, and faults makes reasoning about a distributed system extremely challenging.

If we’ve learned one thing developing Rama, it’s this: a distributed system is only as robust as the quality and thoroughness of its tests. More so, there are certain testing techniques which must be used if you’re going to have any hope of building a robust system resilient to the crazy edge cases that inevitably happen in production. If a distributed system isn’t tested with these techniques, you shouldn’t trust it.

Rama being such a general purpose system, capable of handling all the computation and storage needs for entire backends at scale, has a particularly large testing surface. First, there’s the core functionality of Rama to test: distributed logs (depots), ETL topologies (streaming and microbatching), indexes (PStates), and queries. Then there are the “module transition” operations for updating or scaling running Rama applications. Finally, there’s replication. Replication needs to be tested when healthy, when all replicas are keeping up, and also when replicas fall behind and need to catch up. Compounding the testing surface is that all of these need to be tested under the conditions of any number of faults happening at any time, such as disk errors, nodes losing power, and network partitions.

Testing is largely a sampling problem. Each sample exercises the system at a particular state, with input data of some size and shape, at some amount of load, and with some set of faults at some frequency. A testing strategy needs to sample this input space in a representative way. In a highly concurrent distributed system, where there are so many ways that events can be randomly ordered across different threads, achieving a representative sample is difficult. And if something isn’t tested, it’s either broken or will be broken in the near future.

Another huge issue in testing is reproducibility. A test that’s hard to reproduce is very expensive, or possibly even impossible, to debug. Test failures in concurrent systems are frequently difficult to reproduce due to the inherent randomness of its execution. As you’ll see shortly, this is another major focus of our testing strategy for Rama.

To fully cover the testing surface of Rama, we use multiple strategies. The first line of defense is our unit tests, which we run twice every hour as well as with every commit. In addition to that, we do automated chaos testing 24/7 on multiple distributed clusters. The last major piece is load testing, where we run large clusters for long periods of time with large amounts of data. There’s lots of detail to how we approach all of these which I’ll cover in the subsequent sections.

The straightforward but wrong way to unit test

The straightforward way to unit test is to run the system like how it would be operating in production, with many threads and lots of concurrency. In this context, instead of running the system across many processes and many nodes, you instead simulate the various processes comprising the system all within the same testing process. Intuitively, running a system as close as possible to how it operates in production seems like a good way to test. However, this approach has massive flaws that make it a poor basis for which to unit test a distributed system.

In Rama we implemented this approach in a testing tool called “in-process cluster” (IPC). For basic things, like verifying how PStates are updated in a module, this approach works fine. Failures in tests like these are fairly easy to track down. However, when testing more complex scenarios like leadership switches during processing, random worker deaths, or far-horizon catchup, debugging can get extremely difficult.

When developing tests for scenarios like those using IPC, we frequently ran into situations where we would get a test failure that would only reproduce once in a hundred test runs. Sometimes it would be even less, like one in a thousand test runs. To debug the test, we would add logging and then run the test over and over until it reproduced. Unfortunately, in IPC the very act of adding logging changes the timings of execution and can make reproduction less likely. It may make reproduction so unlikely that it’s basically impossible to reproduce. So then you have to get very creative to find ways to get more information without changing the timing so much that it no longer reproduces. This cycle of adding logging and reproducing is very time-consuming. Sometimes a single bug would take us weeks to figure out.

The expense of debugging isn’t the worst issue of IPC though. The worst issue is how difficult it is to thoroughly explore the testing space. The vast majority of issues that we’ve debugged in Rama have had to do with ordering of events. Many bugs can be triggered by one particular thread getting randomly stalled for an unusual amount of time (e.g. from GC). Other bugs can come from rare orderings of events on a single thread. Exploring these scenarios in IPC requires using semaphores to try to coordinate the threads in question into these unusual permutations, or just hoping that random timings during execution sometimes explore these orderings. Exploring internal timeouts and the consequences of those requires even more creativity.

Fortunately, we found a better way.

Deterministic simulation

Inspired by what the FoundationDB team wrote about simulation testing, we invested heavily into implementing “deterministic simulation” for Rama. Deterministic simulation is counterintuitive, but it is by far the best way to unit test a distributed system.

The idea of deterministic simulation is to run the whole system on a single thread. What would normally be its own thread in production is instead an executor explicitly managed by a central controller. The controller runs in a loop choosing which executor gets to run an event each iteration. A random number generator is used to choose executors, and a test is fully reproducible by just using the same seed for the RNG. Tests are easily debugged since adding logging does not change the reproducibility of the test like it frequently does for IPC.

Deterministic simulation removes all concurrency from execution of Rama during tests. This seems like it would be a bad thing by making the unit test environment fundamentally different from production. However, our experience that the vast majority of issues have to do with event ordering and timing means the exact opposite. Deterministic simulation is incredible – almost magical – for diagnosing and debugging these issues. Deterministic simulation isn’t sufficient as a complete testing strategy, as you still need tests that exercise potential concurrency issues, but it is overwhelmingly better for most tests.

Implementing deterministic simulation required us to refactor the Rama implementation to be able to run in either “production mode” using many threads, or “simulation mode” using a single thread. We also needed to abstract the notion of time. In simulation, all entities interact with a time source that’s managed by the central controller of simulation. Usually, we advance time by a small amount per iteration, but sometimes we use other strategies. For examples, tests specifically of timeouts usually disable automatic time advancement and advance time explicitly. They may advance time to one millisecond before the timeout, verify nothing happens, and then verify the timeout behavior is triggered when advancing by one additional millisecond.

Here’s pseudocode on what a simulation test looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

(with-simulated-cluster [cluster]
(let [module (get-module-definition-for-test)]
(cluster-launch-module!
cluster
module
{:tasks 4
:threads 2
:workers 2})

(execute-until! cluster
(= [RUNNING] (module-state cluster)))

(execute-until! cluster
(with-context (task-context cluster module 0)
(some-predicate-using-task-local-data?)))

(with-simulation-options {:time-strategy nil}
(execute-until! cluster
(another-predicate?)))

(is (yet-another-predicate?))

(advance-time! cluster 10000 :millis)

(with-simulation-options {:time-strategy nil}
(execute-until! cluster
(different-predicate?)))
))

In a simulation test, nothing happens unless you explicitly tell it to execute. So in between any of those execute-until! calls, the simulation is frozen. While frozen, we can peer into the state of any of the executors in the simulation and assert on it. Those execute-until! calls run the simulation until the precise event that causes a condition to become true. We can then assert on the entire state of the system at that point, across what in production would be many threads across many processes. From the :time-strategy options and usage of advance-time! , you can see how we’re able to explicitly control time in the simulation.

Besides reproducibility, deterministic simulation hugely improves our ability to explore the testing space of Rama. We are no longer beholden to random timings or require burdensome coordination with semaphores to explore unusual event orderings. We can suspend a particular executor to simulate a GC pause. We can use a non-uniform distribution to prefer certain executors over others. We can insert faults or other disturbances at extremely precise moments.

Each run of a simulated test is a sample of the testing space, so it’s critical to continually run all tests all the time. Each seed explores a different permutation of events. We run our full build twice an hour as well as with each commit, so we get at least 48 samples of each test every day.

Uncoordinated simulation tests

Many of our simulation tests check specific behaviors, like “when a microbatch fails, the work is retried from the previous commit”. A different category of simulation tests are what we call “uncoordinated simulation tests”.

In an uncoordinated simulation test, we create many entities that interact with a Rama cluster with one or more modules deployed on it. Some of those entities may be appending new data. Others may be querying PStates. Others may be performing module operations like module update or scaling. Then there may be some antagonistic entities disturbing the cluster, like creating random disk faults, killing workers at random, or suspending executors at random. Each entity has a rate at which they perform these actions, such as “one depot append every 100 milliseconds of simulated time”. Most importantly, the entities are not coordinated with one another and take actions independently. They’re registered as executors in the simulation and are given opportunity to run events when chosen by the central controller. In tests like these, we verify high-level properties like no data loss, no stalling, data being processed in the correct order, and all replicas eventually catching up to become part of the ISR.

Uncoordinated simulation tests are particularly good at finding race conditions. The randomness and lack of coordination causes runs of the test to eventually explore all possible race conditions. And since the test is fully reproducible, we can easily track down the cause of any failures no matter how obscure the event ordering.

Chaos testing on real clusters

As mentioned, deterministic simulation is only part of our testing strategy. As powerful as it is, there are a few critical things it doesn’t cover:

True concurrency issues, like improper concurrent access to shared memory
Memory or other resource leaks
Running cluster as separate processes on separate nodes

To cover these gaps, another critical piece is automated testing on a few distributed clusters that we run 24/7. We call these our “quality clusters”.

Each cluster has a dedicated process controlling it. This central process interacts with the cluster either by: appending new data, doing queries, or performing a variety of disturbances. The disturbances include random worker kills, full cluster restarts, network partitions, and suspending nodes to force them into far-horizon catchup. The central process performs high-level assertions about no data loss, cluster integrity, nothing being stalled, and nothing unexpected in any log file.

There’s tension between disturbing the cluster, which often involves killing/restarting processes, and also wanting to test long-lived processes to verify there’s no memory leaks. So the central process chooses one task group and any workers containing replicas of that task group to be protected for the lifetime of the cluster. Any routines that cause process death never pick those workers. This enables the cluster to test long-lived processes as well as the full range of disturbances.

We have three quality clusters, called “nightly”, “disturbances-monthly”, and “module-operations”. The “nightly” cluster is replaced every day with the latest Rama commit and is meant to catch recent regressions quickly.

The “disturbances-monthly” cluster is replaced every month with the latest commit. By the end of the run, if all went well, the longest-lived process will have been running for a month. This cluster deploys many modules exercising the full feature-set of Rama, and it also runs a special module called “LoadModule” which creates heavy load on all the other modules. “disturbances-monthly” performs the full range of disturbances. It does not do any module operations like module update or scaling, however, since this cluster tests long-lived processes and those routines replace every process for a module.

The “module-operations” cluster runs the same set of modules as “disturbances-monthly” and is dedicated to exercising module update and scaling. It also performs disturbances during these module operations to verify their fault-tolerance. A module operation should always either succeed completely or abort. An abort can be due to there being too much chaos on the cluster, like such frequent worker kills that the operation can’t go through in a reasonable amount of time. “module-operations” verifies these operations never stall under any conditions and that there’s never any data loss.

Issues surfaced by these clusters are much harder to debug as they are frequently difficult to reproduce. If the cause of an issue isn’t obvious, our first line of attack is to try to reproduce the issue in a simulation test. If it reproduces in simulation, then it’s easy to debug from there. This has proven to be effective for most issues surfaced, and then that simulation test serves as a regression test moving forward.

Load testing

Another type of testing we do is load testing. The goal of load testing is to test bigger clusters with bigger individual partitions than we achieve in our month-long quality clusters. When everything is bigger, things like timeouts and stall detection are better exercised.

Load testing is expensive, requiring us to run hundreds of nodes for long periods of time. So we currently do load tests as one-offs rather than continuously (e.g. ahead of a release). For example, leading up to the launch of our Twitter-scale Mastodon instance in August, we did thorough load testing for long periods of time.

We’re currently working on implementing incremental backups for Rama. Once that’s done, we’re planning to use that in our quality clusters to start off each new cluster with the data that accumulated on the last cluster. This will enable our quality testing to operate with large individual partitions, which will enable continuous testing of that one aspect of load testing.

Conclusion

We take testing very seriously at Red Planet Labs. We spend more than 50% of our time on testing, and we believe any software team building infrastructure must do the same to be able to deliver a robust tool.

We’re constantly working on improving our test coverage, like by incorporating new kinds of faults in our simulations and quality clusters. An idea we had recently is “long-running simulation tests” where we run simulations for days or weeks at a time. This testing strategy is interesting, since though any failures will be reproducible, a failure that happens after two weeks of running will still take two weeks to reproduce. But possibly these simulations will be able uncover issues that only occur after extended runtimes, and being able to reproduce a potentially obscure issue is still very valuable.

Software cannot be understood purely in the abstract. It requires empirical evidence to know how it behaves in the strenuous conditions it will face in real-world deployments. A major reason it took us 10 years to build Rama was going through that process of testing, iterating, and testing some more until we were confident Rama was ready for production use.

You can follow Red Planet Labs on Twitter here and you can follow the author of this post on Twitter here.