Rama matches CockroachDB’s TPC-C performance at 40% less AWS cost

The TPC-C benchmark is an industry-standard benchmark for measuring OLTP performance. It simulates an e-commerce workload with a mix of five different types of read and write transactions on a complex schema, including cross-partition transactions.

We chose CockroachDB as our comparison point for Rama because they publish detailed, reproducible TPC-C results and are one of the most well-known distributed databases.

Here’s a summary of throughput for each:

	CockroachDB	Rama
Nodes	81 x c5d.9xlarge	64 x i8g.4xlarge
Cluster cost/hr	$139.97	$87.81
vCPUs (total)	2,916	1,024
Warehouses	140,000	140,000
RF	3	3
tpmC	1,684,437	1,676,800
Efficiency	95.5%	95.0%
Cost per tpmC	$0.0000831/hr	$0.0000524/hr

Like CockroachDB, Rama is fully ACID-compliant. Both benchmarks were done with equal amounts of replication and equivalent isolation levels. The code and instructions to reproduce the Rama results are located here.

Here are the latency numbers for each. All values are in milliseconds. “New Order”, “Payment”, and “Delivery” are read-write transactions, while “Order Status” and “Stock Level” are read-only.

Transaction		CockroachDB	Rama
New Order (initiate)	p50	N/A	22.4
	p90		129.5
	p95		165.6
	p99		236.9
	max		708.4
New Order (complete)	p50	402.7	1,107.9
	p90	1,409.3	1,574.7
	p95	2,684.4	1,717.7
	p99	9,126.8	2,226.8
	max	45,097.2	3,082.0
Payment (initiate)	p50	N/A	22.3
	p90		129.5
	p95		165.6
	p99		237.0
	max		685.6
Payment (complete)	p50	251.7	1,115.3
	p90	1,006.6	1,588.5
	p95	2,415.9	1,730.3
	p99	15,032.4	2,242.7
	max	103,079.2	3,086.3
Delivery (initiate)	p50	N/A	22.1
	p90		129.0
	p95		165.2
	p99		236.6
	max		607.1
Delivery (complete)	p50	302.0	1,100*
	p90	1,140.9	1,600*
	p95	2,415.9	1,700*
	p99	9,126.8	2,200*
	max	55,834.6	3,100*
Order Status	p50	6.8	10.0
	p90	62.9	73.0
	p95	125.8	112.0
	p99	4,160.7	188.3
	max	33,286.0	542.9
Stock Level	p50	39.8	70.0
	p90	469.8	200.2
	p95	906.0	243.2
	p99	5,905.6	332.1
	max	38,654.7	772.2

*Estimated. Delivery complete latencies were not directly measured but are comparable to New Order and Payment, as all write transactions are processed together.

Rama’s write transactions report two latencies: “initiate” and “complete”. The initiate latency measures how long it takes for the event to be durably recorded and replicated. The complete latency measures time until the transaction finishes, including time to initiate. Because the event being durably stored at initiate time guarantees it will process transactionally (explained below), the front-end can acknowledge the user’s action at that point and doesn’t have to wait for the full transaction to complete. Not every use case can take advantage of this as sometimes you need the result of the completed transaction before responding, but when you can, it’s a significant advantage.

CockroachDB’s median latencies for writes are lower than Rama’s median complete latencies, but the averages are about the same because of CockroachDB’s much wider latency distribution (TPC-C throughput is determined by response time of transactions, and since throughputs are about the same the average response times must be about the same). CockroachDB’s p99 latencies spike to 9-15 seconds and max latencies reach 45-103 seconds, while Rama’s p99 complete latencies stay under 2.2 seconds with max latencies under 3.1 seconds. CockroachDB’s write tail latencies are 100 to 400x higher than its medians, while Rama’s are only about 3x higher. This consistency matters in production – huge tail latencies don’t just affect the unlucky requests, they cause timeouts and retries that degrade the entire system.

Lines of code

Beyond performance, there’s a significant difference in implementation complexity. The table below compares lines of code for each transaction, including the transaction logic itself, execution logic, and the data generation code for each:

Transaction	CockroachDB	Rama
New Order	540	270
Payment	410	130
Delivery	170	60
Order Status	140	60
Stock Level	70	30
Total	1,330	550

Rama’s implementation is less than half the code. This isn’t because of language differences but because Rama’s programming model is fundamentally more concise.

TPC-C benchmark summary

The TPC-C benchmark revolves around the concept of a “terminal” that simulates a person interacting with the system. Each terminal takes one of five different actions (new order, payment, delivery, order status query, stock levels query), and once that action completes waits for a variable amount of “keying and think time” before taking the next action.

The number of terminals allowed is limited by the number of “warehouses”, with each warehouse supporting exactly 10 terminals. Each warehouse also adds a fixed amount of data to the system, so scaling up the number of warehouses scales both the terminal count and the dataset size proportionally. Because each terminal spends time waiting between transactions, there is a maximum rate at which each terminal can submit transactions. This means overall throughput is capped by the number of warehouses, and you can’t increase throughput numbers without also scaling the amount of data managed by the system.

In TPC-C, throughput depends not only on how many transactions the system can handle but also on how quickly it responds. Since each terminal waits for a response before starting its keying and think time, slower responses reduce the rate at which that terminal can submit transactions. The efficiency metric captures this by comparing the achieved throughput to the theoretical maximum if all response times were zero.

Rama’s efficiency

Rama is matching CockroachDB’s performance with significantly fewer resources, so clearly Rama is doing something fundamentally different under the hood.

In CockroachDB and distributed databases generally, each transaction executes independently. Every individual transaction must coordinate across nodes by acquiring locks, reaching consensus, resolving conflicts, and committing. When you have thousands of concurrent transactions, you’re paying that coordination overhead thousands of times.

The Rama-based TPC-C implementation takes a different approach. Instead of processing transactions individually, work is grouped into “microbatches”. Each microbatch processes many operations together, amortizing the coordination overhead across all of them. The overhead of a single microbatch is much higher than the overhead of a single CockroachDB transaction, but the aggregate overhead of thousands of individual transactions far exceeds the overhead of one microbatch that handles them all.

Critically, microbatch processing provides exactly-once semantics for updates into Rama’s indexed datastores (called PStates). This means Rama achieves the same ACID durability and consistency guarantees that CockroachDB provides, but through a fundamentally more efficient mechanism. Because of exactly-once semantics, every microbatch is a cross-partition transaction.

As a developer using Rama, you program the microbatch code yourself as arbitrary business logic in Java or another JVM language. This code is colocated with your data and can switch between partitions at will, giving you full control over how operations are composed within a single atomic microbatch. The latencies for all the different write transactions for TPC-C are nearly identical because they’re being processed by the same microbatches in this implementation.

Note that Rama also supports stream processing for use cases that need lower latency. Stream processing can complete operations in just a few milliseconds, but they have configurable at-least-once or at-most-once semantics rather than the stronger exactly-once guarantee of microbatch processing. Microbatch processing is ideal for workloads that need very high throughput or cross-partition transactions, exactly the profile of TPC-C.

Rama’s performance on heavy cross-partition transaction workload

The TPC-C specification only exercises cross-partition transactions lightly. Just 10% of new order transactions are cross-partition, and within those transactions only about 10% of the work actually crosses partition boundaries. This means TPC-C only lightly tests the scenario that is most painful for distributed databases.

For distributed databases like CockroachDB, cross-partition transactions are expensive because they require coordination across nodes. Each cross-partition transaction incurs additional network round-trips for consensus and a commit protocol to ensure atomicity. These costs are paid for every cross-partition transaction and scale with the number of partitions involved.

For Rama’s microbatch processing, cross-partition work doesn’t change the coordination overhead. A cross-partition operation just adds network transfer to move computation to the relevant partition.

To demonstrate this advantage, we ran a modified version of the TPC-C benchmark where 99% of the new order transactions are cross-partition. Here are the results in comparison to the standard TPC-C workload:

	Rama (Standard)	Rama (99% cross-partition)
Nodes	64 x i8g.4xlarge	64 x i8g.4xlarge
Warehouses	140,000	140,000
RF	3	3
tpmC	1,676,800	1,507,200

This is only a 10% decrease in throughput. We don’t believe CockroachDB or any distributed database built on the individual-transaction model can come close to matching Rama’s efficiency on this workload. The cost of cross-partition transactions in that model is inherent to the architecture where every cross-partition transaction must pay for distributed coordination individually. With Rama’s microbatch processing, cross-partition coordination is a natural part of how they work rather than a special expensive case.

Efficiency of event sourcing

Rama is an event sourced system. A common reaction to this is that event sourcing must be slower than a traditional database, since you’re writing to a log and then writing to storage, rather than just writing to storage directly.

This benchmark proves the opposite. The efficiency gains from integrating computation and storage into a single system more than offset the cost of the additional log write.

The other common objection to event sourcing is complexity. In a traditional architecture, adopting event sourcing means bolting a message queue like Kafka onto your existing database, adding a stream processor to consume from it, and then managing the interactions between all these separate systems. That is genuinely complex. But Rama is a single integrated system rather than many separate ones, so it’s simple to operate. And unlike traditional event sourcing, Rama is not eventually consistent. It’s interactive and fully ACID compliant, as this benchmark demonstrates.

So Rama gives the benefits of event sourcing without any of the drawbacks: complete auditability, the ability to derive new views from historical data, and the flexibility to reprocess events when your business logic changes.

Conclusion

Rama is a very different way to build backend applications. The programming model, the way you think about data, and the way you think about computation are all unfamiliar at first. But the benefits are substantial. The efficiency gains demonstrated in this benchmark are just one dimension. Rama also eliminates the infrastructure sprawl that plagues traditional architectures, replacing the tangle of databases, queues, caches, and stream processors with a unified platform.

There is a learning curve, but there are good resources to help. The tutorial in the Rama docs walks you through the core concepts step by step, and this blog post series goes deep into building real applications. We’ve also found that AI assistants are increasingly effective at helping with Rama development. Given the docs and a few examples, they can both write Rama code and explain what’s happening, which can significantly accelerate the learning process.

If you’re interested in exploring Rama further, come join us on Discord. We’re happy to answer questions and help you get started.