Teaching LLMs to one-shot complex backends at scale, report #1

The LLM is not all that matters in AI coding. What the LLM is targeting matters a great deal. A simpler target that requires less reasoning will produce better results.

Attempts to get LLMs to produce complex backends have been lackluster. A recent paper, Constraint Decay: The Fragility of LLM Agents in Backend Code Generation, shows that even on a simple CRUD app, end-to-end success on the full test suite tops out at 33% once realistic structural constraints are imposed.

Conventional backends are made of many separate systems glued together, each with its own model and failure modes. Most of the failures observed in these benchmarks show up at the seams between these systems. The LLM is not asked to reason about one coherent system – it’s asked to coordinate across many.

Along those lines, we believe Rama is ideally positioned to take LLM coding to the next level for backends. Rama collapses the typical backend stack (databases, queues, stream processors, application logic) into one integrated system. The seams that current LLMs trip over largely don’t exist in a Rama application. A horizontally scalable, fault-tolerant backend is expressed as one coherent program rather than as glue across half a dozen systems.

In the past few months we’ve been working on a project to teach LLMs to one-shot complex backends at scale with Rama as the substrate. Our results so far are very promising, as I’ll review later in this post, but we have a ways to go. The major milestone we’re working towards is one-shotting the entire Matrix spec, which also has a thorough set of tests available that can be used to verify an implementation. What we’re looking to produce is:

A generated implementation of Matrix that passes all the reference tests
Transcript showing every step of how the LLM one-shotted the project
Benchmarks automatically written and executed by the LLM that demonstrate high performance and horizontal scalability

Matrix is orders of magnitude more difficult than the backends current LLMs can handle, particularly with these scalability and fault-tolerance requirements, so one-shotting it will be a huge milestone. However, the overarching goal is for this to work on any backend problem. We don’t expect what we’re building to one-shot every possible backend. Humans remain vastly better than LLMs at broad systems design where many tradeoffs must be considered. What we think is achievable, and what this project is targeting, is a workflow where humans assist with high-level design decisions and the agent handles lower-level decisions and implementation, including achieving horizontal scalability and fault-tolerance. By “fault tolerant” we mean the system continues operating correctly through infrastructure failures (e.g. node deaths) without data loss, data duplication, or downtime, and recovers automatically when failed components return.

Whether our goal is possible remains to be seen, but I’ll be documenting our progress as we go via these progress reports.

Our workflow

We work through the rama-ai-learn project, which we just open-sourced. It’s a benchmark and harness for measuring how well LLMs can produce production Rama code, along with the skill content the agent uses to do the work.

Each task we throw at an agent is a “challenge.” A challenge directory (example) contains a README.md stating operations, latency targets, and other constraints. It also has an interface the agent must implement. The directory also contains private artifacts that are encrypted before runs so the agent can’t see them: tests covering functional correctness, fault-tolerance, and performance, and a reference implementation. After an agent finishes its implementation and passes its own tests, the challenge runner runs the formerly encrypted tests to determine whether the agent succeeded or failed.

Agents are run inside a Docker container with full permissions. We capture every agent invocation’s full transcript, including thinking, tool uses, tool results, and the final response. Thinking is particularly valuable. It’s how we discover failure modes that don’t show up in the produced code, like an agent identifying a fault-tolerance gap, going back and forth on possible solutions, and then saying “this is getting complicated” and failing to address it at all.

Rama has Java and Clojure APIs. We’re focused on Clojure for now but will produce an equally capable Java version of the skills later. The REPL is the main reason, as with a long-running REPL session, the agent evaluates code and inspects results in milliseconds instead of constantly paying for JVM startup and dependency loading. We expect this gap to matter more as challenges get harder and converging on a correct design takes the agent many iterations.

Working on improving LLM performance involves making a new challenge and then iterating on the skill files until the agent passes it consistently. Then we move on to a new challenge that stresses a different type of application, a different capability of Rama, or an application of greater scope.

Current status

Three medium-complexity challenges now one-shot correctly on almost every run, all horizontally scalable and fault-tolerant:

Bank transfer: tracks funds for each user and supports deposits, transfers between users, balance reads, and inbound/outbound transfer history reads per user. Transfers must be exactly-once and fault-tolerant: no double-spends under retry, no transfer that lands on one side but not the other under failure, and no negative balances from insufficient funds.
Time series: records render latency measurements per URL and answers range-aggregate queries (cardinality, total, min, max) over any minute-bucket range from a single minute to many years. Queries must be fast across the full range of query lengths, whether five minutes or five years.
Auction: hosts auctions where sellers list items with expiration times, bidders place bids, and the highest bid at expiration wins. Supports reading a seller’s listings, a listing’s bids and current top bid, and a user’s notifications. Auctions must end automatically when their expiration time passes, and seller / winner / losing-bidder notifications must each be delivered exactly once and in chronological order.

The private tests verify performance and fault-tolerance characteristics using the with-event-hook macro that we added in the latest Rama release. With this we can capture and assert on computation being balanced, the number and types of underlying RocksDB operations, topology types chosen by the agent, force failures/retries, and more.

The bank transfer challenge is the easiest of the three. The main test is whether the agent recognizes that update latency in the hundreds of milliseconds is acceptable, and therefore that a microbatch topology is the right tool. Microbatch topologies have fault-tolerant exactly-once semantics for all computation. The challenge also verifies the agent chooses the optimal PState (Rama’s equivalent to databases) structures, especially that transfer logs are subindexed since they’re unbounded.

The hard part of the time series challenge is pre-aggregating the latency data at multiple granularities and then using a server side distributed query (called a “query topology”) to efficiently compute the range query while reading as few buckets as possible. The agent does a great job of reasoning through how many RocksDB operations will be done depending on the number of buckets read and then choosing the appropriate number of granularities. It also recognizes a query topology is appropriate since that’s more efficient than many roundtrips between the client and worker nodes.

The auction module is the hardest, with multiple features with differing performance characteristics, polymorphic data (for notification types), and time-based behavior. Getting notifications of auction results to have exactly-once delivery with fault-tolerance is easy to get wrong. The agent sometimes lands on something close to the reference design – a stream topology for listings and bids and a microbatch topology for expirations and notifications. What surprised me is it also sometimes produces a design I had never considered. Using just a stream topology, notifications are stored as a map rather than a list, keyed deterministically by the listing’s timestamp, the listing ID, and the notification type. Because the key is deterministic, a retried expiration rewrites the same keys with the same values, achieving exactly-once by making delivery of the notification an idempotent operation.

The published benchmarks show LLMs struggling with far simpler backends. They measure single-instance applications with no scalability or fault-tolerance requirements. Our challenges add exactly-once semantics, fault tolerance, horizontal scalability, and performance constraints that the tests actively verify. With seams removed, the model can spend its reasoning budget on the application requirements rather than a host of random technical details tangential to the application.

Skill files

The skill files we’ve developed have gone through many iterations in order to pass these challenges consistently. At first, they were basically just the Rama documentation translated to Markdown files. The top-level skill had core information needed to program Rama (concepts, dataflow syntax, paths), while less-needed information was put in reference files.

The agent consistently made the same mistakes even when the correct guidance was loaded in context. An agent that reads a reference at the start of a session would fail to revisit it at the moment of a specific decision. Some examples:

Rather than refer to the var bound by (defmodule FooModule ...) directly as FooModule , it would try to construct it as (FooModule.)
Unbounded locations in a PState would not be subindexed
Object would be used for schemas instead of something precise
Partitioners would be missing, especially before writes, causing the module to write to the wrong locations

We researched best practices on making skills and then did a major restructuring of the skills to instruct the agent to use a phased approach:

Implicit spec. Derive every edge case and invariant the protocol implies but doesn’t state. Produces an IMPLICIT_SPEC.md document.
Plan. Design the depots, PState schemas, and topologies before writing any code. Produces PLAN.md .
Plan validation. Review the plan against a checklist of scenarios (e.g. race conditions, failures/retries). Produces PLAN_VALIDATION.md with a pass or fail verdict.
Implement. Write the module source, adhering strictly to the plan.
Implementation validation. Review the implementation against a checklist of common mistakes. Produces IMPLEMENTATION_VALIDATION.md .
Tests. Write thorough tests covering every protocol method and every edge case from the implicit spec.
Test validation. Review the test code against a checklist of common mistakes. Produces TEST_VALIDATION.md .
Run tests. Run the test suite.

Three things make this effective. Requiring an artifact at every phase prevents the agent from silently skipping a step. The validation phases are written as explicit checklists, and LLMs are reliable at walking a checklist item-by-item without forgetting entries. And the phase split lets the agent focus on smaller pieces of work at a time.

A few language patterns turned out to make a big difference. Negative constraints with a reason work much better than positive suggestions at preventing the agent from going down wrong paths. Including a reason is critical as otherwise the agent comes up with some random justification to ignore it (e.g. “Do NOT write any topology code in this phase — implementations without plans produce wrong PState schemas, incorrect partition alignment, and unnecessary disk I/O”). We specify explicit default choices and require justification to deviate (e.g. “default to microbatch; if stream, state which API requires it” forces the agent to justify the choice it’s making, while “choose stream or microbatch” lets the agent go with whatever it already decided).

When working on the auction challenge, we started to run into issues with compaction. When the agent would compact in the middle of a run, it would frequently skip critical instructions. So we made some changes to the skill to make it compaction-resistant. Instead of all the instructions living in the skill itself, we changed the skill to instruct the agent to copy pre-written templates for all five artifacts into the implementation directory. Each template contains the full checklist and instructions for filling it in, plus a signpost at the top saying what phase it belongs to and what comes next. The agent’s job is to “fill in” each template, not write from scratch. We also instruct it to put signposts in comments in the implementation files.

After compaction, whatever file the agent reads, whether an unfilled template, the source code, or a partially completed validation, the signposts tell it where it is in the workflow and what to do next. The key insight is that critical instructions living in context get lobotomized by compaction, but instructions embedded in files on disk are permanent. The templates and signposts ensure the agent always has a roadmap. It knows where it came from (which artifacts are already filled in) and where it needs to go next (what the current template is asking for).

The phases, validations, and signposts are a redundant structure that constrain the agent’s search space and funnel it towards a correct solution. There are some critical instructions that we put at the top of the skill to help with that funneling.

One of the most important is emphasizing that the agent is building a production-worthy application. Without this explicit emphasis, the agent would frequently say things like “This code is just for testing locally, so failures are rare and I don’t need to consider them”. We actually lie in the challenge prompt by saying “You are building a production Rama module. It will be deployed under real conditions — node failures, processing retries, concurrent clients, high throughput.”

Other critical instructions at the top of the skill are “Never trade I/O efficiency for code simplicity” and “Never trade fault tolerance for code simplicity”. LLMs have a strong bias for what they consider “simple” code, willing to sacrifice important performance and fault-tolerance properties in the process. In reality, the code to achieve efficiency and fault-tolerance in these challenges barely requires any more code, so these instructions cause the agent to search a little harder to find the right solution.

The rest of the skill is information about Rama that the agent uses to design and implement. One of the most impactful pieces of information was adding latency numbers for RocksDB operations (which Rama uses in the implementation of PStates). We provide approximate numbers for the cost of RocksDB reads, writes, iterator seeks, and iterator reads. During challenge runs, you can see the agent do the math to estimate the cost of various strategies to inform potential design decisions. In the time series challenge, for example, this is how the agent decides to use multiple granularities and how many granularities to use.

We only put general, high-level information in the skills. All examples in the skills are unrelated to the challenges we’ve made. Watching the agent reason from the high-level properties of Rama to implementing correct solutions for each challenge is very cool to see.

Orchestration

Until recently, we had the agent orchestrate itself through all the phases of the skill. We prompted it with the challenge instructions and let it walk the whole process in one session. That stopped working on the latest challenge we’re working on (described below). Once an agent committed to a design early in the session, it tunneled on that design for the rest of the run, even when its own later thinking surfaced problems with it. The validation phases in particular would rationalize away findings with justifications that didn’t make sense.

We made a custom orchestration script to fix this. It invokes the agent once per phase with a fresh context. Each phase reads only its own doc and the artifacts produced by earlier phases, then stops. Validation phases are instructed to be adversarial against the artifacts from the previous agent. The rationalization loops we saw before largely don’t have a place to form, because the agent doing validation is no longer the same agent that wrote the plan or implementation.

As a side effect, much of the compaction-resistance work we did earlier is less relevant. Phases are short enough that compaction rarely fires inside one. That said, the compaction-resistant structure is still worth having if only to lessen the amount of information loaded in context when reading a skill, since so many of the instructions are in the artifacts on disk.

However, there are major issues with the orchestration script which makes me think we’ll ultimately abandon it. First off, it’s unrealistic that this orchestration script would be used by actual developers since it’s contrary to normal usage of coding agents.

It also adds significant latency. Each phase invocation pays a fresh-context cost (re-reading the skill, re-reading prior artifacts, re-loading references) before it can do any real work. In order to lower latency, the script has accumulated a lot of logic about what to do when a validation phase fails. At first the orchestration was strict. A single test failure kicked the run all the way back to the implementation phase, which then had to flow forward through implementation validation, test writing, test validation, and test running again. Most of that re-work was unnecessary, because test failures usually surface small, localized bugs the agent can fix without running most of the phases again. To avoid that overhead we introduced the minor-fail / major-fail distinction in the validation phases (only major changes trigger another adversarial review) and added a “finish” phase at the end where the agent stays in a single context and iterates on the module and tests until the test suite passes. Both changes reduced wasted work, but they grew the state machine considerably.

Instead, we’re thinking of having a top-level agent handle orchestration and delegate to subagents for execution of each phase. We’ll have to make sure we can still capture equivalent transcripts, and we’ll have to see what kinds of mistakes the top-level agent makes during orchestration.

Finally, a couple of small things we include as part of orchestration of the implementation phase are worth mentioning. After the agent finishes writing the module, it must verify the namespace loads cleanly before finishing the phase. Additionally, it runs a linter and fixes all errors before moving on. The linting hooks we made for Rama statically catch things like arity mismatches, using invalid dataflow forms, or referencing undeclared PStates. Including both of these steps as part of the implementation phase saves a lot of time in what would later be a test failure causing phases to retry.

Current challenge

We’re currently working on the “fanout” challenge, a social-media-style backend with profiles, posts, follows, and a per-user merged timeline of posts from accounts the user follows. It’s based on our Twitter-scale Mastodon implementation. The module the agent must implement runs alongside a provided social-graph module that stores the social graph in an optimized way so that fanout can be balanced even for a heavily unbalanced social graph. We wrote in detail how fanout at scale works in this blog post.

The spec requires that merged timelines be kept in memory (writing those to disk drops throughput by 15x), and that fanout be balanced and fair. A post by an account with millions of followers cannot delay fanout for a post by an account with ten. Because timelines are kept in memory, it needs to find an alternate way to make it fault-tolerant. The correct approach for that is to reconstruct lost timelines on read by querying for the recent posts of followees.

The agent has not yet produced an implementation that satisfies the non-functional constraints. It gets the functionality correct and is designing and implementing the reconstruction process, but it’s failing:

Minimizing data in the cache and using memory and GC-efficient data structures. The correct solution uses a ring buffer of long values for just the user ID and post ID for each entry in the timeline. The agent so far is always using a TreeMap for the timeline and also storing the post content in the cache, which massively increases memory usage unnecessarily since that can just be fetched in the query topology that fetches a timeline page.
Fairness is not handled at all. It eagerly delivers posts to all followers immediately instead of spreading out delivery for large users over a longer period of time, violating the spec.

We’re iterating on more guidance and validation steps to get the agent to make the correct decisions for these. It’s possible these are the kinds of details LLMs are unable to get right, since they’re high-level design decisions that require reasoning about runtime tradeoffs. So it’s fine if we have to make the challenge easier by telling it how to achieve these properties. But it would be better if we can get the agent to come to these conclusions on its own.

Model choice

We’re currently testing with Opus 4.6. We tried 4.7, but its responses include only summaries of the model’s reasoning rather than the raw thinking blocks. That’s a real problem for skill development, as seeing the model’s thought process is critical to determine how to get it to stop going down wrong paths.

We plan to test with other models, especially Codex, once we make more progress.

Open questions

We have many open questions to resolve:

How many challenges do we need to be confident we’re not inadvertently overfitting? There’s nothing specific to the challenges in the skills / orchestration, but their structure might be tuned to the particular patterns LLMs follow for these particular problems.
Matrix is too big a project to implement the whole thing via one iteration through our phase structure. How should orchestration be structured, and how should the agent split up the work? How does it iterate on a design as it learns more through implementation?
The Matrix test suite will be helpful not just for evaluating a Matrix implementation, but it could be helpful for the agent to use during development to iterate and check exact behavior. Will we be able to get Matrix to one-shot without access to the tests?
What’s the best way for developers to use these skills in practice? How should artifact generation be handled?
How can larger projects be efficiently parallelized?
What’s the impact of generating Rama applications that need to integrate with lots of existing infrastructure?

LLM usage during development

LLMs are helpful for building this project, though they can be actively detrimental if not used properly.

The most useful application has been transcript analysis. We had the LLM build a transcript analysis script (scripts/analyze-latest-transcript.py) with subcommands for searching thinking blocks, tool calls, validation artifacts, and so on. The LLM then uses that script freely when we ask it to dig into why an agent made a particular mistake without us needing to give permission constantly for it to run custom shell or Python commands to analyze transcripts.

We also use LLMs to build the harness and tooling for the project. LLMs built the orchestration script that runs challenges, and they’re helpful for assisting in making new challenges.

We also have an alignment-scoring pass that runs after each challenge. It compares the agent’s implementation and tests to the reference implementation and tests, producing a short summary of differences. It’s not a significant part of the workflow, but reading a short summary is a faster way to spot a regression than diffing files directly.

LLMs are useful for brainstorming skill updates after we identify a failure mode, but only as assistants. They’re not good at understanding why they made a mistake. They always think they have an answer, even when the problems are subtle. They cheat like crazy, trying to add challenge-specific info or examples into generic reference files. So LLM suggestions can help with critical analysis, but they don’t substitute for it.

Conclusion

I expect future progress reports to be much shorter than this one, since I’ll just talk about new progress. I hope these progress reports are useful to anyone using LLMs or developing skills of their own, and I hope to get useful ideas and feedback from anyone following along.