2.5x better performance: Rama vs. MongoDB and Cassandra

We ran a number of benchmarks comparing Rama against the latest stable versions of MongoDB and Cassandra. The code for these benchmarks is available on Github. Rama’s indexes (called PStates) can reproduce any database’s data model since each PState is an arbitrary combination of durable data structures of any size. We chose to do our initial benchmarks against MongoDB and Cassandra because they’re widely used and like Rama, they’re horizontally scalable. In the future we’ll also benchmark against other databases of different data models.

There are some critical differences between these systems that are important to keep in mind when looking at these benchmarks. In particular, Cassandra by default does not guarantee writes are durable when giving acknowledgement of write success. It has a config commitlog_sync that specifies its strategy to sync its commit log to disk. The default setting “periodic” does the sync every 10 seconds. This means Cassandra can lose up to 10 seconds of acknowledged writes and regress reads on those keys (we disagree strongly with this setting being the default, but that’s a post for another day).

Rama has extremely strong ACID properties. An acknowledged write is guaranteed to be durable on the leader and all in-sync followers. This is an enormous difference with Cassandra’s default settings. As you’ll see, Rama beats or comes close to Cassandra in every benchmark. You’ll also see we benchmarked Cassandra with a commitlog_sync setting that does guarantee durability, but that causes its performance to plummet far below Rama.

MongoDB, at least in the latest version, also provides a durability guarantee by default. We benchmarked MongoDB with this default setting. Rama significantly outperforms MongoDB in every benchmark.

Another huge difference between Rama and MongoDB/Cassandra (and pretty much every database) comes from Rama being a much more general purpose system. Rama explicitly distinguishes data from indexes and stores them separately. Data is stored in durable, partitioned logs called “depots”. Depots are a distinct concept from “commit logs”, which is a separate mechanism that MongoDB, Cassandra, and Rama also have as part of their implementations. When using Rama, you code “topologies” that materialize any number of indexes of any shape from depots. You can use depots to recompute indexes if you made a mistake, or you can use depots to materialize entirely new indexes in the future to support new features. Depots can be consumed by multiple topologies materializing multiple indexes of different shapes. So not only is Rama in these benchmarks materializing equivalent indexes as MongoDB / Cassandra with great comparable performance, it’s also materializing a durable log. This is a non-trivial amount of additional work Rama is doing, and we weren’t expecting Rama to perform so strongly compared to databases that aren’t doing this additional work.

Benchmark setup

All benchmarks were done on a single m6gd.large instance on AWS. We used this instance type rather than m6g.large so we could use a local SSD to avoid complications with IOPS limits when using EBS.

We’re just testing single node performance in this benchmark. We may repeat these tests with clusters of varying sizes in the future, including with replication. However, all three systems have already demonstrated linear scalability so we’re most interested in raw single-node performance for this set of benchmarks.

For all three systems we only tested with the primary index, and we did not include secondary indexes in these tests. We tried configuring Cassandra to have the same heap size of Rama’s worker (4GB) instead of the default 2GB that it was choosing, but that actually made its read performance drastically worse. So we left it to choose its own memory settings.

The table definition used for Cassandra was:

1
2
3
4
5
6
CREATE TABLE IF NOT EXISTS test.test (
  pk text,
  ck text,
  value text,
  PRIMARY KEY (pk, ck)
);

This is representative of the kind of indexing that Cassandra can handle efficiently, like performing range queries on a clustering key.

All Cassandra reads/writes were done with the prepared statements "SELECT value FROM test.test WHERE pk = ? AND ck = ?;" and "INSERT INTO test.test (pk, ck, value) VALUES (?, ?, ?);" .

Cassandra was tested with both the “periodic” commitlog_sync config, which does not guarantee durability of writes, and the “batch” commitlog_sync config, which does guarantee durability of writes. We played with different values of commitlog_sync_batch_window_in_ms , but that had no effect on performance. We also tried the “group” commitlog_sync config, but we couldn’t get its throughput to be higher than “batch” mode. We tried many permutations of the configs commitlog_sync_group_window (e.g. 1ms, 10ms, 20ms, 100ms) and concurrent_writes (e.g. 32, 64, 128, 256), but the highest we could get the throughput was about 90% that of batch mode. The other suggestions on the Cassandra mailing list didn’t help.

The Rama PState equivalent to this Cassandra table had this data structure schema:

1
{[String, String] -> String}

The module definition was:

1
2
3
4
5
6
7
8
9
10
11
(defmodule CassandraModule [setup topologies]
  (declare-depot setup *insert-depot :random)

  (let [s (stream-topology topologies "cassandra")]
    (declare-pstate s $$primary {java.util.List String})
    (<<sources s
      (source> *insert-depot :> *data)
      (ops/explode *data :> [*pk *ck *val])
      (|hash *pk)
      (local-transform> [(keypath [*pk *ck]) (termval *val)] $$primary)
      )))

This receives triples of partitioning key, clustering key, and value and writes it into the PState, ensuring the data is partitioned by the partitioning key.

Cassandra and Rama both index using LSM trees, which sorts on disk by key. Defining the key as a pair like this is equivalent to Cassandra’s “partitioning key” and “clustering key” definition, as it’s first sorted by the first element and then by the second element. This means the same kinds of efficient point queries or range queries can be done.

The Rama PState equivalent to MongoDB’s index had this data structure schema:

1
{String -> Map}

The module definition was:

1
2
3
4
5
6
7
8
9
10
11
(defmodule MongoModule [setup topologies]
  (declare-depot setup *insert-depot :random)

  (let [s (stream-topology topologies "mongo")]
    (declare-pstate s $$primary {String java.util.Map})
    (<<sources s
      (source> *insert-depot :> *data)
      (ops/explode *data :> {:keys [*_id] :as *m})
      (|hash *_id)
      (local-transform> [(keypath *_id) (termval *m)] $$primary)
      )))

This receives maps containing an :_id field and writes each map to the $$primary index under that ID, keeping the data partitioned based on the ID.

We used strings for the IDs given to MongoDB, so we used strings in the Rama definition as well. MongoDB’s documents are just maps, so they’re stored that way in the Rama equivalent.

Writing these modules using Rama’s Java API is pretty much the same amount of code. There’s no difference in performance between Rama’s Clojure and Java APIs as they both end up as the same bytecode.

Max write throughput benchmark

For the max write throughput benchmark, we wrote to each respective system as fast as possible from a single client colocated on the same node. Each request contained a batch of 100 writes, and the client used a semaphore and the system’s async API to only allow 1000 writes to be in-flight at a time. As requests got acknowledged, more requests were sent out.

As described above, we built one Rama module that mimics how MongoDB works and another module that mimics how Cassandra works. We then did head to head benchmarks against each database with tests writing identical data.

For the MongoDB tests, we wrote documents solely containing an “_id” key set to a UUID. Here’s MongoDB vs. Rama:

Rama’s throughput stabilized after 50 minutes, and MongoDB’s throughput continued to decrease all the way to the end of the three hour test. By the end, Rama’s throughput was 9x higher.

For the Cassandra tests, each write contained a separate UUID for the fields “pk”, “ck”, and “value”. We benchmarked Cassandra both with the default “periodic” commit mode, which does not guarantee durability on write acknowledgement, and with the “batch” commit mode, which does guarantee durability. As mentioned earlier, we couldn’t get Cassandra’s “group” commit mode to match the performance of “batch” mode, so we focused our benchmarks on the other two modes. Here’s a chart with benchmarks of each of these modes along with Rama:

Since Rama guarantees durable writes, the equivalent comparison is against Cassandra’s batch commit mode. As you can see, Rama’s throughput is 2.5x higher. Rama’s throughput is only a little bit below Cassandra when Cassandra is run without the durability guarantee.

Mixed read/write throughput benchmark

For the mixed read/write benchmark, we first wrote a fixed amount of data into each system. We wanted to see the performance after each system had a significant amount of data in it, as we didn’t want read performance skewed by the dataset being small enough to fit entirely in memory.

For the MongoDB tests, we wrote documents solely containing an “_id” field with a stringified number that incremented by two for each write (“0”, “2”, “4”, “6”, etc.). We wrote 250M of those documents (max ID was “500000000”). Then for the mixed reads/writes test, we did 50% reads and 50% writes. 1000 pairs of read/writes were in-flight at a time. Each write was a single document (as opposed to batch write test above which did 100 at a time), and each read was randomly chosen from the keyspace from “0” to the max ID. Since only half the numbers were written, this means each read had a 50% chance of being a hit and a 50% chance of being a miss.

Here’s the result of the benchmark for MongoDB vs. Rama:

We also ran another test of MongoDB with half the initial data:

MongoDB’s performance is unaffected by the change in data volume, and Rama outperforms MongoDB in this benchmark by 2.5x.

For the Cassandra tests, we followed a similar strategy. For every write, we incremented the ID by two and wrote that number stringifed for the “pk”, “ck”, and “value” fields (e.g. "INSERT INTO test.test (pk, ck, value) VALUES ('2', '2', '2');" ). Reads were similarly chosen randomly from the keyspace from “0” to the max ID, with each read fetching the value for a “pk” and “ck” pair. Just like the MongoDB tests, each read had a 50% chance of being a hit and a 50% chance of being a miss.

After writing 250M rows to each system, here’s the result of the benchmark for Cassandra vs. Rama:

Rama performs more than 2.5x better in this benchmark whether Cassandra is guaranteeing durability of writes or not. Since Cassandra’s write performance in this non-durable mode was a little higher than Rama in our batch write throughput test, this test indicates its read performance is substantially worse.

Cassandra’s non-durable commit mode being slightly worse than its durable commit mode in this benchmark, along with Cassandra’s reputation as a high performance database, made us wonder if we misconfigured something. As described earlier, we tried increasing the memory allocated to the Cassandra process to match Rama (4GB), but that actually made its performance much worse. We made sure Cassandra was configured to use the local SSD for everything (data dir, commit log, and saved caches dir). Nothing else in the cassandra.yaml or cassandra-env.sh files seemed misconfigured. There are a variety of configs relating to compaction and caching that could be relevant, but Rama has similar configs that we also didn’t tune for these tests. So we left those at the defaults for both systems. After double-checking all the configs we reran this benchmark for Cassandra for both commit modes and got the same results.

One suspicious data point was the amount of disk space used by each system. Since we wrote a fixed amount of identical data to each system before this test, we could compare this directly. Cassandra used 11GB for its “data dir”, which doesn’t include the commit log. Rama used 4GB for the equivalent. If you add up the raw amount of bytes used by 250M rows with identical “pk”, “ck”, and “value” fields that are stringified numbers incrementing by two, you end up with 6.1GB. Both Cassandra and Rama compress data on disk, and since there are so many identical values compression should be effective. We don’t know enough about the implementation of Cassandra to say why its disk usage is so high relative to the amount of data being put into it.

We ran the test again for Cassandra with half the data (125M rows), and these were the results:

Cassandra’s numbers are much better here, though the numbers were degrading towards the end. Cassandra’s read performance seems to suffer as the dataset gets larger.

Conclusion

We were surprised by how well Rama performed relative to Cassandra and MongoDB given that it also materializes a durable log. When compared to modes of operation that guarantee durability, Rama performed at least 2.5x better in every benchmark.

Benchmarks should always be taken with a grain of salt. We only tested on one kind of hardware, with contrived data, with specific access patterns, and with default configs. It’s possible MongoDB and Cassandra perform much better on different kinds of data sets or on different hardware.

Rama’s performance is reflective of the amount of work we put into its design and implementation. One of the key techniques we use all over the place in Rama’s implementation is what we call a “trailing flush”. This technique allows all disk and network operations to be batched even though they’re invoked one at a time. This is important because disk syncs and network flushes are expensive. For example, when an append is done to a depot (durable log), we don’t apply that immediately. Instead the appends gets put into an in-memory buffer, and an event is enqueued that will flush that buffer if no such event is already enqueued. When that event comes to the front of the processing queue, it flushes whatever has accumulated on the buffer. If the rate of appends is low, it may do a disk operation for a single append. As the rate of appends gets higher, the number of appends that gets performed together increases. This technique greatly increases throughput while also minimizing latency. We use this technique for sending appends from a client, for flushing network messages in Netty (called “flush consolidation”), for writing to indexes, for sending replication messages to followers, and more.

The only performance numbers we shared previously were for our Twitter-scale Mastodon instance, so we felt it was important to publish some more numbers against tools many are already familiar with. If there are any flaws in how we benchmarked MongoDB or Cassandra, please share with us and we’ll be happy to repeat the benchmarks.

Since Rama encompasses so much more than data indexing, in the future we will be doing more benchmarks against different kinds of tooling, like queues and processing systems. Additionally, since Rama is an integrated system we expect its most impressive performance numbers to be when benchmarked against combinations of tooling (e.g. Kafka + Storm + Cassandra + ElasticSearch). Rama eliminates the overhead inherent when using combinations of tooling like that.

Finally, since Rama is currently in private beta you have to join the beta to get access to a full release in order to be able to reproduce these benchmarks. As mentioned at the start of this post, the code we used for the benchmarks is on our Github. Benchmarks are of course better when they can be independently reproduced. Eventually Rama will be generally available, but in the meantime we felt publishing the numbers was still important even with this limitation.

Leave a Reply