Tour of our 250k line Clojure codebase

At Red Planet Labs we’ve been quietly developing a new kind of developer tool for many years. Our tool reduces the cost of building large-scale end-to-end applications by multiple orders of magnitude, and Clojure is a big reason why we’ve been able to tackle such an ambitious project with a small team.

Our codebase consists of 250k lines of Clojure split evenly between source and test code. It’s one of the largest Clojure codebases in the world. In this post I’ll give a tour of how we organize our code so a project of this size can be understood amongst a team, the development and testing techniques we use that leverage the unique qualities of Clojure, and an overview of the key libraries we use.

Custom language within a language

One of the coolest parts of our codebase is the new general purpose language at its foundation. Though the semantics of the language are substantially different than Clojure, it’s defined entirely within Clojure using macros to express the differing behavior. It compiles directly to bytecode using the ASM library. The rest of our system is built using both this language and vanilla Clojure, interoperating seamlessly.

One of the striking capabilities our language has that vanilla Clojure does not is first-class continuations. The way in which our language expresses continuations makes it extremely good at async, parallel, and reactive programming. All of these are foundational to the large-scale distributed infrastructure we’re building.

That you can build an entirely new language with radically different semantics within Clojure demonstrates how powerful Clojure is. There’s a lot you get "for free" when building a language this way: lexing, parsing, datatypes, namespaces, immutable data structures, and the entire library ecosystem of Clojure and the JVM. Ultimately our new language is Clojure since it’s defined within Clojure, so it benefits from seamless interoperability with both Clojure and the JVM.

The vast majority of applications are not going to need to develop a full language like we have. But there are plenty of use cases where a focused DSL is appropriate, and we have examples of that too. The ability when using Clojure to customize how code itself is interpreted, via macros and meta-programming, is an incredibly powerful capability.

Type/schema checking

Central to any codebase is the data that is created, managed, and manipulated. We find it’s imperative to carefully and clearly document the data flying around the system. At the same time, type or schema annotations add overhead so it’s important to be thoughtful and not overdo it.

We use the Schema library for defining datatypes within our codebase. It’s easy to use and we like the flexibility to define schema constraints beyond just types: e.g. arbitrary predicates, enums, and unions. Our codebase contains about 600 type definitions, most of which are annotated using Schema.

Around Schema we have a helper called "defrecord+" which defines constructor functions which also perform validation (e.g. for type Foo it generates "->valid-Foo" and "map->valid-Foo"). These functions throw a descriptive exception if the schema check fails.

There’s no static type checking in Clojure, and static type checks wouldn’t be able to check all the kinds of constraints we define using Schema anyway (e.g. the value of a number being within a certain range). We’ve found we only need to insert schema checking on either:

  • Construction of types, for which our auto-generated "valid" constructor functions remove all the ceremony. Detecting an error when creating a record is much better than when using it later on, as during creation you have the context needed to debug the problem.
  • A few strategic spots throughout the codebase where lots of different types flow.

We only occasionally annotate the types of function args and return values. We find instead that being consistent about how we name things is good enough for understanding the code. We do have about 500 assertions throughout our codebase, though these are generally about higher-level properties rather than simple type checks.

The approach we’ve taken for schema definition and enforcement is lightweight, comprehensive, and doesn’t get in our way. The lack of static typing in Clojure scares a lot of programmers who have never used Clojure, and all we can say is that with a little bit of thought in how you organize your code it’s not an issue at all. And doing things dynamically means we can enforce stronger constraints than possible with static type systems.

Multi-module repository setup

Our codebase exists in a single git repo with four modules to split up the implementation:

  • "core", which contains the definition of our compiler and the corresponding abstractions for parallel programming
  • "distributed", which implements those parallel programming abstractions as a distributed cluster
  • "rpl-specter", an internal fork of Specter which adds a ton of functionality
  • "webui", which implements the front end of our product

We use Leiningen and deps.edn for our build. The ability to specify local targets as dependencies in deps.edn files is key to our multi-module setup, and the basic organization of our source tree looks like:

1
2
3
4
5
6
7
8
9
10
project.clj
deps.edn
rpl-specter/project.clj
rpl-specter/deps.edn
core/project.clj
core/deps.edn
distributed/project.clj
distributed/deps.edn
webui/project.clj
webui/deps.edn

Here’s an excerpt from our deps.edn file for "distributed":

1
2
3
4
5
6
{:deps {rpl/core {:local/root "../core"
                  :deps/manifest :deps}
        ...
        }
  ...
  }

This setup lets us develop within any one of the modules and automatically see any source changes in the other modules without having to make explicit Maven dependencies.

Loading the entire codebase for running tests or loading a REPL is pretty slow (largely from compilation of code using our custom language), so we use AOT compilation heavily to speed up development. Since we spend most of our time developing in “distributed”, we AOT compile “core” to speed things up.

Polymorphic data with Specter

Specter is a library we developed for supercharging our ability to work with data structures, especially nested and recursive data. Specter is based around the concept of “paths” into data structures, where a path can “navigate” to any number of values starting from the root of a data structure. The path can include traversals, views, and filters, and they’re deeply composable.

Our compiler compiles code into an abstract representation with a distinct record type for each kind of operation possible in our language. There are a variety of attributes every operation type must expose in a uniform way. For example, one of these attributes is “needed fields”, the fields in the closure of that operation that it requires to do its work. A typical way to express this polymorphic behavior would be to use an interface or protocol, like so:

1
2
(defprotocol NeededFields
  (needed-fields [this]))

The problem with this approach is it only covers querying. Some phases of our compiler must rewrite the fields throughout the abstract representation (e.g. uniquing vars to remove shadowing) and this protocol doesn’t support that. A (set-needed-fields [this fields] ) method could be added to this protocol, but that doesn’t cleanly fit data types which have a fixed number of input fields. It also doesn’t compose well for nested manipulation.

Instead, we use Specter’s "protocol paths" feature to organize the common attributes of our varying compiler types. Here’s an excerpt from our compiler:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
(defprotocolpath NeededFields [])

(defrecord+ OperationInput
  [fields :- [(s/pred opvar?)]
   apply? :- Boolean
   ])

(defrecord+ Invoke
  [op    :- (s/cond-pre (s/pred opvar?) IFn RFn)
   input :- OperationInput])

(extend-protocolpath NeededFields Invoke
  (multi-path [:op opvar?] [:input :fields ALL]))

(defrecord+ VarAnnotation
  [var :- (s/pred opvar?)
   options :- {s/Keyword Object}])

(extend-protocolpath NeededFields VarAnnotation
  :var)

(defrecord+ Producer
  [producer :- (s/cond-pre (s/pred opvar?) PFn)])

(extend-protocolpath NeededFields Producer
  [:producer opvar?])

"Invoke", for instance, is the type that represents calling another function. The :op field could be a static function or a var reference to a function in the closure. The other path navigates to all the fields used as arguments to the function invocation.

This structure is extremely flexible and allows for modifications to be expressed just as easily as queries by integrating directly with Specter. For instance, we can append a "-foo" suffix to all the needed fields in a sequence of operations like so:

1
(setval [ALL NeededFields NAME END] "-foo" ops)

If we want the unique set of fields used in a sequence of ops, the code is:

1
(set (select [ALL NeededFields] ops))

Protocol paths are a way to make the data itself polymorphic and able to integrate with the supercharged abilities of Specter. They greatly reduce the number of manipulation helper functions that would be required otherwise and make the codebase far more comprehensible.

Organizing complex subsystems with Component

The daemons comprising the distributed system we’re building are comprised of dozens of subsystems that build on top of one another and depend on each other. The subsystems need to be started in a particular order, and in tests they must be torn down in a particular order. Additionally, within tests we need the ability to inject mocks for some subsystems or disable some subsystems altogether.

We use the Component library to organize our subsystems in a way that manages lifecycle and gives us the flexibility to inject alternate dependencies or disable subsystems. Internally, we built a "defrcomponent" helper to unify field and dependency declarations. For example, from our codebase:

1
2
3
4
5
6
7
8
9
10
(defrcomponent AdminUiWebserver
  {:init      [port]
   :deps      [metastore
               service-handler
               cluster-retriever]
   :generated [^org.eclipse.jetty.server.Server jetty-instance]}

  component/Lifecycle
  ...
  )

This automatically retrieves fields "metastore", "service-handler", and "cluster-retriever" from the system map it’s started in and makes them available in the closure of the component’s implementation. It expects one field "port" in the constructor of the component, and it generates another field "jetty-instance" on startup into its internal closure.

We also extended the component lifecycle paradigm with "start-async" and "stop-async" protocol methods. Some components do part of their initialization/teardown on other threads, and it was important for the rest of our system (especially deterministic simulation, described below) for those to be doable in a non-blocking way.

Our test infrastructure builds upon Component for doing dependency injection. For instance, from our test code:

1
2
3
4
5
6
7
8
(sc/with-simulated-cluster
  [{:ticker (rcomponent/noop-component)}
   {:keys [cluster-manager
           executor-service-factory
           metastore]
    :as   full-system}]
  ...
  )

That first map is a dependency injection map, and this code disables the “ticker” component. The “ticker” causes simulation tests to advance time occasionally, and since this test wants to control time explicitly it disables it. That dependency injection map can be used to override or disable any component in the system, providing the flexibility necessary for writing tests.

Using with-redefs for testing

Clojure provides the macro "with-redefs" that can redefine any function executed within the scope of that form, including on other threads. We have found this to be an invaluable feature for writing tests.

Sometimes we use with-redefs to mock specific behavior in the dependencies of what we’re testing so we can test that functionality in isolation. Other times we use it to inject failures to test fault-tolerance.

The most interesting usage of with-redefs in our codebase, and one of our most common, is using it alongside no-op functions we insert into our source code. These functions effectively provide a structured event log that can be dynamically tapped in an à la carte way depending on what a test is interested in.

Here’s one example (out of hundreds in our codebase) of how we use this pattern. One part of our system executes user-specified work in a distributed way and needs to: 1) retry the work if it fails, and 2) checkpoint its progress to a durable, replicated store after a threshold amount of work has succeeded. One of the tests for this injects a failure the first time work is attempted and then verifies the system retries the work.

The source function that executes the work is called "process-data!", and here is an excerpt from that function:

1
2
3
(when (and success? retry?)
  (retry-succeeded)
  (inform-of-progress! manager))

"retry-succeeded" is a no-op function defined as (defn retry-succeeded [] ).

In a totally separate function called "checkpoint-state!", the no-op function "durable-state-checkpointed" is called after it finishes replicating and writing to disk the progress information. In our test code, we have:

1
2
3
4
5
6
7
8
9
10
(deftest retry-user-work-simulated-integration-test
  (let [checkpoints     (volatile! 0)
        retry-successes (volatile! 0)]
    (with-redefs [manager/durable-state-checkpointed
                  (fn [] (vswap! checkpoints inc))

                  manager/retry-succeeded
                  (fn [] (vswap! retry-successes inc))]
      ...
      )))

Then in the body of the test, we check the correct internal events happen at the correct moments.

Best of all, since this à la carte event log approach is based on no-op functions, it adds basically no overhead when the code runs in production. We have found this approach to be an incredibly powerful testing technique that utilizes Clojure’s design in a unique way.

Macro usage

We have about 400 macros defined through our codebase, 70% of which are part of source code and 30% of which are for test code only. We have found the common advice for macros, like don’t use a macro when you can use a function, to be wise guidance. That we have 400 macros doing things you can’t do with regular functions demonstrates the extent to which we make abstractions that go far beyond what you can do with a typical language that doesn’t have a powerful macro system.

About 100 of our macros are simple "with-" style macros which open a resource at the start and ensure the resource is cleaned up when the form exits. We use these macros for things like managing file lifecycles, managing log levels, scoping configurations, and managing complex system lifecycles.

About 60 of our macros define abstractions of our custom language. In all of these the interpretation of the forms within is different than vanilla Clojure.

Many of our macros are utility macros, like "letlocals" which lets us more easily mix variable binding with side effects. We use it heavily in test code like so:

1
2
3
4
5
(letlocals
  (bind a (mk-a-thing))
  (do-something! a)
  (bind b (mk-another-thing))
  (is (= (foo b) (bar a))))

This code expands to:

1
2
3
4
(let [a (mk-a-thing)
      _ (do-something! a)
      b (mk-another-thing)]
  (is (= (foo b) (bar a))))

The rest of the macros are a mix of internal abstractions, like a state machine DSL we built, and various idiosyncratic implementation details where the macro removes code duplication that can’t be removed otherwise.

Macros are a language feature that can be abused to produce terribly confusing code, or they can be leveraged to produce fantastically elegant code. Like anything else in software development, the result you end up with is determined by the skill of those using it. At Red Planet Labs we can’t imagine building software systems without macros in our toolbox.

Deterministic simulation

As we wrote about previously, we have the ability to write 100% reproducible distributed systems tests by running our whole system on a single thread and randomizing the order in which entities execute events starting from a random seed. Simulation is a major, codebase-spanning capability that heavily utilizes the aforementioned techniques of dependency injection and redefs. For example:

  • Any part of the system that in production would be a unique thread is coded in terms of executor services. To get an executor service for that particular part of the system, it requests one from an "executor service factory". In production, this returns new threads. In simulation, however, we override that component to provide executor services from our single-threaded, globally managed source.
  • Much of our system relies on time (e.g. timeouts), so time is abstracted away from our implementation. Any part of the system that is interested in time consults a "time source" dependency. In production this is the system clock, but in simulation the component is overridden with a "simulated time source" that can be explicitly controlled within our simulation tests.
  • Promises are used quite a bit throughout the codebase to manage asynchronous, non-blocking behavior. Simulation uses with-redefs to layer in additionally functionality into promises useful for stepping through simulation.

Front end

Our product provides a UI to let users see what they have running on a cluster, the current status of operations like scaling, and telemetry showing what’s going on in their applications.

The front end is a web-based single page app coded in ClojureScript. The ClojureScript ecosystem has many mature, well-designed libraries that make development efficient and fun.

Reviewing the libraries and their advantages could be a blog post in itself, but briefly: we use re-frame because its data-oriented state management and event handling models are easy to reason about and inspect. We use reitit for frontend routing; we like how its data-oriented design allows us to associate arbitrary data with each route, which in turn lets us do neat things like dispatch re-frame events on route changes. We use shadow-cljs to compile the project, in part because it dramatically simplifies the process of using JavaScript libraries and dealing with externs.

We use uPlot for displaying time-series data. Our API backend is served using a Jetty server, and we use Compojure to define backend routes.

Defining our front end in the same language as the rest of our codebase is a huge win, especially the ease of shuttling data back and forth between Clojure and ClojureScript. The immutable style emphasized by Clojure is just as beneficial in front-end code as back-end code, so being able to leverage that consistently benefits our productivity and the robustness of our product greatly.

Libraries

Here are many of the external libraries we use in our codebase, a mixture of Clojure, ClojureScript, Java, and Javascript libraries:

  • ASM: used for bytecode generation
  • Compojure: used for defining routes in web server
  • Component: used for defining subsystems with well-defined lifecycles
  • Jetty: used to serve data to our front end
  • Loom: used for representing graph data structures, especially within our compiler.
  • Netty: used for asynchronous network communication
  • Nippy: used for serialization
  • Potemkin: used for a few utilities, especially "import-namespace", "import-vars", and "def-map-type"
  • reitit: used for front-end routing
  • re-frame: used to build our web code
  • RocksDB: used for some durable indexing tasks
  • Schema: used for defining types with rich schemas
  • shadow-cljs: used for compiling front-end code
  • SnakeYAML: used for parsing YAML
  • Thrift: used to help power some of the CLI of our product
  • uPlot: used to display time series graphs in our front end

Conclusion

Clojure has been fantastic for developing our product. It’s enabled us to build powerful abstractions not possible in other languages, remove all ceremony whatsoever, and utilize powerful testing techniques. Plus we’ve had multiple members on our team start with no Clojure or functional programming experience and they were able to get up to speed quickly.

If you’re interested in working with us to help define the future of software development, we’re hiring! We work on hard problems pushing what’s possible with compilers, databases, and distributed systems. Our team is fully distributed and we’re open to hiring anywhere in the world.

Where we’re going, we don’t need threads: Simulating Distributed Systems

Testing distributed systems is hard.

You probably already knew that. If you’ve spent even a little time writing distributed systems, you know that the relevant algorithms and techniques are complex and nuanced. Distributed systems end up full of gotchas and corner cases. The tests for these systems are even worse! If you actually care that your application is correct, it’s exactly the hard-to-reproduce failures that are going to make you want to throw your computer into the garbage and run off to live in the woods. 

Before you start looking for habitable caves, we have good news: there’s a way out of this problem, and it’s called deterministic simulation.

It’s all about ordering

Let’s take a step back and ask ourselves why distributed systems are so hard to test. In the standard unit testing methodology, you identify blocks of behavior (functions, classes), carefully control their inputs across the range of acceptable values, and validate their outputs. This is easy when your behavior is fully synchronous, because the only inputs you need to control are the actual data. 

But in a distributed system — which really means “any system with concurrency greater than 1” — while you do control the “inputs” in terms of arguments, you typically do not control another key variable: ordering of events. “Event” is abstract here, but for the most part, we are interested in points of interaction between concurrent actors. (Think: two threads printing to standard out, or a user trying to update the same field on their profile from multiple browser tabs.) Depending on what order each concurrent actor reads or writes some piece of data, your system can give wildly different results.

When you run a test, it runs in some order. But when you run it again, it might run in a different order, or it might not, depending on a bunch of totally invisible parameters! (What’s the background cpu load on your machine? How full is the heap in the test process? Did the JIT compiler run yet?) It’s often the case that it’s easy to get a test passing consistently in your own environment, only to have it fail when it runs in a different environment like your CI system. It might not even fail every time! Now you’re stuck trying to reproduce that failure with more adverse development conditions, repeatedly running the same test to try and catch that rare failure. You can easily lose hours or days trying to reproduce and fix these kinds of problems.

The traditional approach in this situation is to put in ad-hoc measures that exclude unexpected orderings of events: you replace implementations with mocks, or inject “extra” locking, or jump through hoops to verify the system reaches a specific state before you run your test assertions. This can be brittle, or worse, end up compromising the validity of your tests, since you’re no longer testing the code that you actually run in production. But the real problem with this approach is that it only prevents the unexpected orderings you can imagine up front! Your system can still reach unanticipated states based on orderings that you didn’t even know were possible.

No matter how many tests you write of your distributed system, if you can’t figure out some way to systematically test ordering, there are going to be vast swaths of territory that are completely uncharted. The more complex the system, the more likely it is that some totally bonkers emergent bad behavior is lurking just out of sight. 

Deterministic Simulation

Deterministic simulation is all about changing ordering from an uncontrolled variable into a controlled one. With that control, you can vary execution order directly, just like any other test input. This is an incredibly powerful technique for getting to a whole new level of rigor and repeatability in distributed systems testing, making it much easier to find bugs related to ordering and to repeat those orderings once they’ve been detected. It’s a gigantic productivity lever for your team as you build your ever-more-complex application.

This probably sounds pretty pie in the sky, and it’s true that there is no “out of the box” simulation tooling that can be bolted on to any application. But it turns out that building a simulation capability that is well-suited to your specific environment and codebase is totally achievable. Folks like FoundationDB have talked publicly about their experience following this path, and their success was a major inspiration to our own efforts. 

We’ve built up a functionally complete deterministic simulation capability at Red Planet Labs over the last few months. It’s been a transformative time for us that has already dramatically reduced the cost of finding and fixing complex ordering bugs in our distributed system. Some problems that took us weeks to resolve in the past are now getting fixed in a single afternoon! We’ve also gained the ability to test conditions that would have been difficult or impossible to create in the past. We’ll be building on top of these capabilities for a long time to come.

For us, the formula ended up being no parallelism + quantized execution + deterministic behavior = deterministic simulation. We’ll go through each of those terms below. The rest of this post will be devoted to talking through some of the interesting philosophical and design pattern discoveries we’ve made along the way, which will hopefully provide somewhere for you to start should you decide to go down this road. 

No Parallelism

If you let concurrent actors in your simulated system actually execute in parallel, you really hamstring your ability to control execution order, since there are fundamentally no guarantees about the order in which their tasks execute. To give your simulation complete control, you must explicitly remove all parallelism. One way to achieve this is to run all concurrent actors on a single thread in the same process. 

Wait — you’re trying to test distributed systems. Isn’t parallelism a fundamental characteristic of distributed systems? How can you test anything interesting if you have no parallelism? While distributed systems are indeed both concurrent and parallel, the vast majority of the problems these systems experience are a result of logical concurrency mistakes rather than actual parallel resource contention.  

When you have two co-located concurrent actors running in parallel, they can both try to access a shared resource in the same exact instant. This is called contention. Problems caused by contention are often solved by using some kind of lock, which literally reduces the allowed parallelism to 1. However, even when there is no contention, your concurrent actors can still misbehave, based solely on how their accesses interleave with each other. These problems arise from flawed concurrency logic present at a higher level of the application, and are rarely caused — or fixed — by locking. 

But when your concurrent actors don’t share the same process or hardware, they can’t access shared resources in the first place — they have to use some kind of asynchronous message passing and wait for a reply. They cannot contend. That means all bugs found in the interaction between two independent concurrent actors must therefore be of the concurrency kind, which means they only depend on order. 

A simulation scheme with no parallelism can readily explore orderings in the concurrency space, but is incapable of exploring orderings in the contention space. This is a good trade off, though, because the bulk of the problems you’re going to experience in a distributed system are of the concurrent type. 

Contention problems are still problems that have to be tested and fixed. However, it’s common for distributed system implementations to minimize direct access to shared resources and emphasize use of message-passing style even when co-located, since it makes the entire system more consistent and easier to understand. With this approach, any implementation details that can cause contention get abstracted into the “framework” layer of your application, where they can be tested exhaustively to verify correctness without having to worry about the bigger picture app-level concurrency. When this pattern is used, the no-parallelism simulation approach makes no sacrifices in its ability to explore orderings.

Quantized Execution

If you want to be able to control “what happens next” in a granular fashion, you need to change the way concurrent work streams are expressed in your application. If you define your concurrent work streams naively — as threads, for instance — then in production you get all the isolation and independence attributes you want, but you can’t introspect concurrent work streams during simulation. What the simulation really needs is for concurrent work streams to be uniquely identified and to expose their individual work items.

We call the idea of chunked work in identifiable streams quantized execution. This sounds intense, but in practice, it just means that instead of using threads, you organize everything in terms of ExecutorServices and Runnables / Callables that are submitted to them. You tie it all together with a central “executor service factory” that allows you to effectively inject the execution mode into your runtime system just like any other dependency. In production, the executor service factory returns “real” independent ExecutorServices backed by their own threads, and they will run in a totally uncoordinated manner (that is, governed only by the concurrency structures you put in place). 

But when the system is under simulation, you inject a factory that returns special “facade” ExecutorServices. They capture submitted tasks into a queue, but don’t actually execute anything until they’re explicitly told to do so. The overall simulation controller uses this interface to make the decision about who gets to execute their next work item.

A meaningful cost of this approach is that your application can no longer be defined in terms of “naive” concurrency / parallelism constructs, since those constructs don’t work in simulations. For instance, if you try to do a blocking wait on a response from another actor, it will never complete, because you’re blocking the only thread in your system! Any part of your application that will be simulated must start to follow the cooperative multitasking model; chances are good that this will eventually spread to all parts of your application. 

Refactoring your system as a whole to work in these terms can be painful and expensive, especially while you’re getting used to it. But after you’ve gotten over the hump, you will have an equivalent system that’s better in a lot of ways. The biggest advantage of this approach is that you always run the same code, whether you’re in production or in simulation — there’s no need to vary the behavior of your components to specifically accommodate simulation. This minimizes the chances of your mocks diverging from their real-world counterparts, which is a major risk otherwise.

For Red Planet Labs, the shift to universal cooperative multitasking meant that our system as a whole is more reactive, and we don’t rely on blocking behavior except when interacting with 3rd-party libraries we don’t control. There are a lot of factors that go into it, but on the whole, our simulations tend to be meaningfully faster than their unsimulated counterparts, even when the unsimulated version gets to use multiple cpu cores! 

Deterministic Behavior

We’ve removed genuine parallelism’s unpredictability and made our actors’ work streams transparent with quantized execution. The final ingredient needed to make simulations into a productivity powerhouse is determinism.

Testing of distributed systems is often burdened with two divergent requirements: the ability to reproduce the same result consistently, and the ability to vary or explore possible orderings to produce novel failures. It turns out that it’s pretty easy to get both requirements out of simulations.

At every step of execution, our simulation must select an ExecutorService and let it complete one work item. There are a lot of ways the selection could be implemented, but the scheme that covers the broadest set of needs with the least complexity is using a seeded random number generator to choose an actor randomly. 

By making selection random, you avoid accidentally creating an ordering that is more orderly than what would happen naturally in the real world. If you run the same simulation many times, you can expect execution order to vary, iteratively exploring some fraction of all theoretically possible orderings. But by using a seeded random number generator as the source of that random selection, that variability instantly becomes repeatable. 

To rerun some “interesting” ordering, all you need is the random seed that generated that sequence of selections. When you specify that seed again directly, the properties of PRNGs kick in to give you the same sequence of selections. As long as all the other inputs to the simulation are also the same, then you will repeat the interesting ordering exactly.

It’s worth really hammering on that last point: simulations are only repeatable if all the behavior in them is deterministic. Any source of non-determinism could thwart your efforts, and it’s easy to be sabotaged by very subtle mistakes. If you ever generate a UUID, use random numbers, implicitly rely on hashmap ordering, or refer to system time, your simulation can produce different results even with the same random seed. 

Non-deterministic simulations are very frustrating, since they are often hard to detect in the first place and tedious to diagnose and fix once detected. RPL expects to do more work on our tooling in this area over time; we’ll be sure to share any interesting takeaways.

Beyond Simulation

Even just the basic capabilities of simulation are incredibly impactful when it comes to testing a distributed system. But it turns out that there are all sorts of things you can do on top of simulation that make it even more useful. A system that is compatible with simulation has gone to lengths to expose the seams between many sub-components; you can exploit those seams to create conditions in tests that would otherwise be very difficult to replicate. For instance:

  • Preventing a particular actor from being selected for execution simulates a garbage collection pause
  • Stopping all the executors associated with a logical “process” simulates process death
  • Adding delay to the execution of network IO tasks simulates network latency

And many more! RPL has only scratched the surface in this area, but we expect to exploit this avenue a ton as we seek to test our system under ever more adverse conditions.

Clojure, Faster

Dynamic, functional programming languages like Clojure are sometimes considered “slower” than their statically-typed and/or OO counterparts, due to facilities like dynamic function dispatch and immutable-orientation, etc. But this comparison is too general.

Clojure is readily optimizable in cases where utmost performance is key. Clojure embraces the “Get it working, then make it fast” approach to building up a software system—even large-scale systems with high performance demands¹.

This post is about micro-optimizing Clojure code: how can we improve our Clojure code performance on the JVM when we need it the most, when hot code paths and tight loops prevail—for example, in large-scale data processing systems.

General JVM performance, benchmarking, and tuning is a broad and deep skillset and there are many more considerations beyond Clojure. The focus here is Clojure-specific.

Clojure Performance Tools

Before we get to our specific list of Clojure optimizations, we should emphasize that running your own benchmark and profiling experiments is essential to producing specific top-level performance outcomes for your own code and domain.

You could follow all of the advice in this post, shower your code with type hints, remove laziness and sequences, etc, etc, and⁠ without a specific performance model for where your costs are coming from and how to measure improvement, it is possible or even likely to see no dip in your processing latencies or reduction in your cloud bill².

We recommend the following tools for performing your own analyses; these were all used in developing this post:

  • criterium A benchmarking tool that is a Clojure industry-standard at this point; most of the relative benchmarks in this post are generated from the REPL using criterium.
  • clj-async-profiler For minimal quick and dirty Clojure profiling. REPL-enabled.
  • no.disassemble For inspecting Clojure-compiled byte code from the REPL. (If you’re familiar with javap, the output is what you would see from decompiling classes using javap -c on the command-line, except you can see what a compiled Clojure function looks like.)
  • jmh (Java Microbenchmark Harness) A feature-rich benchmarking toolkit for the JVM. Its usage is heavily Java/annotation-based (so not REPL-friendly) but it has an expressive feature set, making it possible to run more specialized benchmarks such as ones that could deadlock (and many others; see here.) We find it essential in carrying out a comprehensive benchmarking analyses and we’ll demonstrate out how to use it with Clojure a bit later on.

Avoiding Laziness

Clojure’s sequence abstraction and laziness are powerful programming facilities but we must shy away from these tools when memory and compute are in high demand.

Why? Lazy implementations require generating per-element state and this can be substantial overhead when compared to non-lazy alternatives like arrays and vectors whose elements each contribute just their own value’s size to memory.

Accordingly there is computation overhead in laziness due to this per element construction; compare lazy implementations of map and filter to transform values in a vector with a non-lazy (transducer-based) alternative:

1
2
3
4
5
6
7
8
9
10
(let [data (vec (range 100))]
    (report-result
      (quick-benchmark
        (into [] (map inc (filter even? data))) {}))
;; Execution time mean : 3.916001 µs

    (report-result
      (quick-benchmark
        (into [] (comp (filter even?) (map inc)) data) {})))
;; Execution time mean : 1.738479 µs

into has first-class transducer support; the same over-2x gains follow when using transducers + transduce in favor of laziness + reduce.

Avoid laziness where scale matters.

first and last

This one is particularly frustrating. We might expect first and last to have special consideration of their provided type. But this is not the case; they are implemented over Clojure’s lazy sequence API without respect of type. Even with a vector, last will obtain the result using a full linear pass over the data. Both first and last seq the input unnecessarily which is additional computation cost.

Of course we have a way out⁠—but the moral of the story is to stay well away from first and last when optimal performance is required. Here’s a full sequence of benchmarks; notice the dramatic relative cost of first and last on vectors. Pick the latencies you like the best:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
(let [v (into [] (range 1000))]
    (report-result
      (quick-benchmark
        (first v) {}))
;; Execution time mean : 53.459498 ns
    (report-result
      (quick-benchmark
        (nth v 0) {}))
;; Execution time mean : 8.477276 ns
    (report-result
      (quick-benchmark
        (.nth ^Indexed v 0) {}))
;; Execution time mean : 7.898334 ns
    (report-result
      (quick-benchmark
        (last v) {}))
;; Execution time mean : 13.215272 µs  (!!!)
    (report-result
      (quick-benchmark
        (nth v (-> v count dec)) {}))
;; Execution time mean : 93.989401 ns
    ;; The cost of the previous expression is primarily in the count; so we can improve it:
    (report-result
      (quick-benchmark
        (nth v (-> ^IPersistentVector v .length dec)) {}))
    )
;; Execution time mean : 7.155083 ns

Edit: Clojure’s peek will answer the last item of a vector optimally. The benchmark for peek matches precisely the mean execution time of nth + .length in the last benchmark above. [Credit]

equiv and equals

An occasional need in high-performing applications is a read-heavy, in-memory cache or value registry, commonly with some kind of composite/tuple for its map keys. It turns out that the choice of datatype (for both map and key) are… erm… key here.

Consider the following benchmarks for a simple map lookup using different combinations of map type (Clojure’s PersistentHashMap vs. Java’s HashMap) and composite key type.

PersistentHashMap[“a” “b”]167 ns
(ArrayList. [“a” “b”])95 ns
“a:b”78 ns
(→CacheKey “a” “b”)116 ns
HashMap[“a” “b”]36 ns
(ArrayList. [“a” “b”])53 ns
“a:b”33 ns
(→CacheKey “a” “b”)152 ns
Benchmarks for (get map key) for different map and key types. CacheKey is a basic defrecord type; “a:b” is a simple String with the composite values concatenated, etc.

This is quite a lot of variation in performance: JDK’s HashMap performs significantly better (except when using a record key—we’ll look at that one later). What accounts for this?

Clojure defines equality (i.e., clojure.core/=) more generically than Java’s .equals and this definition is used by Clojure’s data structures for equality comparison, PersistentHashMap in this case:

1
2
3
4
5
6
7
8
9
(let [v ["a" "b"] v2 ["a" "b"]]
    (report-result
      (quick-benchmark
        (= v v2) {}))
;; Execution time mean : 91.347726 ns
    (report-result
      (quick-benchmark
        (.equals v v2) {})))
;; Execution time mean : 10.741904 ns

Significant performance difference for these composite (vector) keys.

Of course, Clojure’s = is more general than Java .equals, which is a stricter comparison in several cases; here is just one example:

1
2
3
4
(.equals [1 2] [(int 1) 2])
⇒ false
(= [1 2] [(int 1) 2])
⇒ true

So we need to be sure that the keys we use in our map are carefully constructed to always compare correctly with the stronger .equals semantics.

Lastly, using a defrecord (CacheKey) appears to offer the worst performance. Let’s use some lightweight profiling to understand what is going on here, by comparing records vs vectors as keys:

1
2
3
4
5
6
7
8
(let [k (→CacheKey "a" "b")
      m (HashMap. {(→CacheKey "a" "b") ::val})]
    (prof/profile
      (dotimes [i 100000000] (get m k)))) ;; Figure 1
  (let [k ["a" "b"]
        m (HashMap. {["a" "b"] ::val})]
    (prof/profile
      (dotimes [i 100000000] (get m k)))) ;; Figure 2
Figure 1. Map lookup using a defrecord as the key. This flamegraph shows relative time spent in execution of each function in the call stack — generated from the REPL by clj-async-profiler.
Figure 2. Map lookup using a vector tuple as the key.

From our profiling, the map equality used by the CacheKey record looks suspicious. We can confirm its performance is much worse than vector equals:

1
2
3
4
5
6
7
8
  (let [ka (→CacheKey "a" "b") kb (→CacheKey "a" "b")]
    (quick-benchmark
      (.equals ka kb) {}))
;; Execution time mean : 2.015044 µs
  (let [ka ["a" "b"] kb ["a" "b"]]
    (quick-benchmark
      (.equals ka kb) {})))
;; Execution time mean : 11.376671 ns

The performance overhead we’ve discussed here is minimal for map lookups of singular values like keywords; we are generally okay to stick to Clojure datatypes for data entities where these key types are idiomatic. Still, using records (defrecord) for data entities can have a significant twofactor impact on key lookups:

1
2
3
4
5
6
7
8
  (let [k (→CacheKey "a" "b")]
    (quick-benchmark
      (:a k) {}))
;; Execution time mean :  5.638519 ns
  (let [k {:a "a" :b "b"}]
    (quick-benchmark
      (:a k) {})))
;; Execution time mean : 10.495481 ns

To make a broader generalization to wrap up this section, the JDK ecosystem at large has battled-hardened collection and data structure offerings for high performance use cases and Clojure’s Java interop (with type hints!) gives us access to any of these that make sense for our code hot spots.

Dynamic Vars vs. ThreadLocal

Another occasional need is for a per-thread state mechanism. Clojure has dynamic vars and binding, but the Clojure implementation of reading a dynamic var is far less optimal than using a ThreadLocal directly:

1
2
3
4
5
6
7
8
9
10
11
12
13
(def ^:dynamic *dyna-state* nil)
(def ^ThreadLocal thr-local-state (ThreadLocal.))

(binding [*dyna-state* 100]
    (report-result
      (quick-benchmark
        (.get thr-local-state) {}))
;; Execution time mean : 6.269224 ns
    (report-result
      (quick-benchmark
        *dyna-state* {}))
;; Execution time mean : 42.911467 ns
    )

Even the resolution of var values can be optimized by use of ^:const metadata, which makes the var value inlined at compile-time:

1
2
3
4
5
6
7
8
9
10
11
12
13
(def ^:const ^Long const-val 99)
(def ^Long var-val 99)

(do
    (report-result
      (quick-benchmark
        (unchecked-multiply 1 const-val) {}))
;; Execution time mean : 2.712172 ns
    (report-result
      (quick-benchmark
        (unchecked-multiply 1 var-val) {}))
;; Execution time mean : 4.876835 ns
    )

Of course this means any redef of const-val here will go unnoticed by any previously compiled references to it.

Clojure Transients

For initializing immutable data types with many values, Clojure’s transients are fast:

1
2
3
4
5
6
(quick-benchmark
   (persistent! (reduce conj! (transient []) (range 100000)))
;; Execution time mean: 1.189867 ms

   (reduce conj [] (range 100000))
;; Execution time mean: 2.709462 ms

This benchmark is roughly the implementation of clojure.core/into, by the way, which uses transients for performance—so certainly use into when you can. Transients are well-covered by the documentation but I mention them here for coverage.

Dramatic Gains with Type Hinting

As a Clojure programmer, you will inevitably encounter type hinting but what may come as a surprise is how dramatically it can improve performance.

Clojure is adamantly dynamic, which means Clojure code can compile without needing to know (or even having access to) the types of values that will flow at runtime. In the following function, x and y could take on the value of an array, map, String, or many other types that might show up dynamically:

1
2
(defn same-count? [x y]
   (= (count x) (count y)))

Using criterium, we can baseline this code without having compile-time type information for x and y:

1
2
3
(quick-benchmark
   (same-count? "f" "g") {}))
;; Execution time mean : 71.037835 ns

Reading the source for clojure.core/count we find lots of “instance of” logic to pick and invoke the relevant behavior at runtime. This is all expensive and we can circumvent it entirely by committing to types at compile-time using type hints:

1
2
3
(defn same-count? [^String x ^String y]
   (= (.length x) (.length y)))
;; Execution time mean : 3ns

Note the whopping 20x improvement in mean execution time.

It should also be noted that if we omitted the ^String type hints from this last implementation of same-count? we would have created the worst of both worlds and our average execution time observed by criterium would be 40x worse than baseline (that is, 20x worse than the initial implementation based on count.)

Arrays

Type hinting Java .instanceMember accesses like above is common. But type hints are essential for performant invocation of Clojure’s array functions, as well.

For example, this:

1
(alength my-arr)

is 1000 times more expensive when compiled without type information. The following clocks in at 5.491867 µs mean execution time:

1
2
3
(let [my-arr (identity (long-array 5))]
   (report-result
      (quick-benchmark (alength my-arr) {})))

By removing the (identity) wrapper there (so that Clojure can infer the array type), or by type hinting like so:

1
(alength ^longs my-arr)

…the invocation will reduce to 4.12 ns. Huge.

As a final note, clean benchmarking requires us to be aware that no subtle implicit optimizations are happening where we don’t expect them. Just above we wrapped our array construction with identity to “hide” the type of the array from Clojure’s compiler.

no.disassemble can come in handy here and let us see if reflection is kicking in:

1
2
3
4
5
6
(require '[no.disassemble :as nd])
(nd/disassemble (fn [] (let [my-arr (identity (long-array 5))] (alength my-arr))))

...
38  invokestatic clojure.lang.Reflector.invokeStaticMethod(java.lang.Class, java.lang.String, java.lang.Object[]) : java.lang.Object [53]
...

Compare this to the disassembled bytecode when type-hinting or inference is used:

1
2
3
4
5
(nd/disassemble (fn [] (let [my-arr (long-array 5)] (alength my-arr))))

...
15  invokestatic clojure.lang.RT.alength(long[]) : int [29]
...

We can see clearly that Clojure RT/alength is directly invoked with the proper array type; avoiding reflection for the enormous performance gain.

The Cost of Destructuring

Clojure’s destructuring is elegant and generally performant, but we can get significant performance wins by avoiding it when it matters.

clojure.core/destructure gives us a peek at its mechanics:

1
2
3
4
5
6
(destructure '[[a b c] [1 2 3]])

[vec__3071 [1 2 3]
 a (clojure.core/nth vec__3071 0 nil)
 b (clojure.core/nth vec__3071 1 nil)
 c (clojure.core/nth vec__3071 2 nil)]

destructuring sequential collections, as we can see, leverages nth which evaluates a series of “reflective” conditionals (much like count, as we saw above) to determine the type of its arg in order to dispatch the correct behavior.

How does nth’s performance compare with other facilities?

Arrays

Compare nth on an array vs using aget directly:

1
2
3
4
5
6
7
8
9
10
(let [v (long-array [1 2 3 4 5])]
  (report-result
    (quick-benchmark
      (nth v 4) {}))
  ;; Execution time mean :  114.954441 ns

  (report-result
    (quick-benchmark
      (aget v 4) {}))
  ;; Execution time mean : 5.319098 ns

The aget is a 30x gain over the nth (and destructured) variants. (Again note that the argument to aget should be type-inferred/hinted or else we’ll find it to perform the worst of any variant.)

Vectors

Can destructuring Clojure’s first-class vector types be improved upon?

Consider:

1
2
3
4
5
6
7
8
9
 (let [v [1 2 3 4 5]]
    (report-result
      (quick-benchmark
        (nth v 2) {}))))
;; Execution time mean : 5.9 ns
    (report-result
      (quick-benchmark
       (.nth ^IPersistentVector v 2))))
;; Execution time mean : 3.8 ns

A modest, yet noticeable, 1.5x improvement.

Narrow results like this one are a perfect time to apply suspicion. Let’s attempt to cross-check our benchmark using another tool, JMH—if at least to demonstrate how to use JMH to benchmark Clojure behavior:

Benchmarking Clojure with JMH

JMH requires you to write your benchmarks as Java methods annotated with @Benchmark. To invoke Clojure we will need to rely on Java-to-Clojure interop.

In order to analyze our .nth vs nth performance from JMH, we can compile (eval) Clojure function defs from Java:

1
2
3
4
5
6
    static IFn eval_ = Clojure.var("clojure.core", "eval");
    static IFn dot_nth_fn =
      (IFn) eval_.invoke(Clojure.read(
         "(fn [^clojure.lang.Indexed x i] (.nth x i :not-found))"));
    static IFn nth_fn =
      (IFn) eval_.invoke(Clojure.read("(fn [x i] (nth x i :not-found))"));

We will also need some data to invoke our function; we’ll use a static value in this case (though for more sophisticated benchmarks we could leverage declarative JMH params to analyze how performance changes with input):

1
2
static IPersistentVector cljVector =
    (IPersistentVector) eval_.invoke(Clojure.read("(vec (range 25))"));

And now we can declare our benchmark methods:

1
2
3
4
5
6
7
8
9
10
11
12
13
    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    public Object dot_nth_vector() {
        return dot_nth_fn.invoke(cljVector, 2);
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    public Object nth_vector() {
        return nth_fn.invoke(cljVector, 2);
    }

(Note: some of the boilerplate jmh setup has been omitted in these code snippets, the omitted boilerplate is well-documented here.)

Here are our results using Java HotSpot version 1.8.0_241:

1
2
3
Benchmark                           Mode  Cnt  Score   Error  Units
ClojureNthBenchmark.dot_nth_vector  avgt   10  4.380 ± 0.407  ns/op
ClojureNthBenchmark.nth_vector      avgt   10  6.034 ± 0.156  ns/op

It seems we’ve confirmed our 1.5x gain with JMH, almost to a tee. Still, skepticism should be the rule in benchmark-land. One advantage of JMH is that our benchmarks end up declaratively codified and can be readily recompiled and re-run with various configurations.

Not changing any code, we can recompile using Java 14, and see the above disparity presumably optimized away:

1
2
ClojureNthBenchmark.dot_nth_vector  avgt   10  3.506 ± 0.281  ns/op
ClojureNthBenchmark.nth_vector      avgt   10  3.301 ± 0.376  ns/op

Instead of spending the rest of this section running through permutations of JMH configuration, machine type, and JDK versions, we will leave this to the curious reader. The primary goal here is to stoke suspicion, especially when benchmarking results are narrow.

As a last note on being effective with JMH, we can tidy up our experience a little bit by not having to eval-compile functions defined in Java Strings from within Java code, as we did above. We can define namespaces of benchmark-able functions and load/compile them from our JMH Java class:

1
2
3
4
5
6
static final IFn require = Clojure.var("clojure.core", "require");
static {
   require.invoke(Clojure.read("com.acme.my-benchmarking-suite"));
}
IFn dot_nth_fn =
    (IFn) Clojure.var("com.acme.my-benchmarking-suite", "dot-nth");

Short-Circuiting

Like several of Clojure’s dynamic functions, in the source code of the nth function, we see a series of logical branches to select target behavior based on the provided argument type. The first of these checks in nth is Clojure’s Indexed type; by making this the first dynamic type check in its conditional reflection logic, nth short-circuits in the most common case of a Clojure collection datatype.

We should apply the same trick to our own code. When short-cutting on conditional logic in our code (say, for one example, when using cond), we benefit by placing the most-commonly true conditions first. Here is a contrived example:

1
2
3
4
5
6
7
8
9
10
(let [x "2020"]
    (report-result
      (quick-benchmark
        (cond
          (instance? Long x) :long
          (instance? Integer x) :int
          (instance? Short x) :short
          (instance? Date x) :date
          (instance? String x) :string) {})))
;; Execution time mean : 6.4 ns

If String is the most commonly flowing type, we pay a big price⁠—by moving the String instance check to the top we reduce to 2.4 ns, a 2.5x improvement.

Final Notes

Keep in mind the famous Kent Beck-attributed aphorism “Make it work, then it make it fast,” or the Donald Knuth-attributed “Premature optimization is the root of all evil.” Generally Clojure makes for high developer productivity and solid performance out of the box; but for areas where optimal performance is a must, getting there from within Clojure is well within reach.


Footnotes

¹ In the end, and rarely necessary, one can always hand-roll Java code and invoke it from Clojure. There is always a way to full optimization if tuning the Clojure itself is not obvious; Clojure’s Java interop is solid and Lein and Maven are both capable of simple polyglot build configurations.

² Some of the fodder for this post came from performance tuning my open-source project thurber; one overall outcome of that effort was a 2x reduction in auto-scaling costs for the game engine Beam pipelines (used as integration tests for that project) when run on the GCP Cloud Dataflow platform. But careful focus on optimizing the specific hot code paths was key to that top-level outcome.

Serializing and Deserializing Clojure Fns with Nippy

We use Nippy at Red Planet Labs whenever we need general purpose serialization / deserialization of Clojure objects. It’s easy to use, fast, and covers virtually everything you could ever want to serialize in a Clojure program.

One notable exception is Clojure fns. There are several places in the distributed computing application we’re developing where storing or transmitting a serialized fn is really the best option. It’s worth talking about what we mean when we say serialized fn. We aren’t looking to serialize and distribute code — we already have a good mechanism for that. (It’s called a jar.) Instead, what we want to do is transmit fn instances between processes as a way to capture state and intention. Clojure already provides the primitives needed to treat a fn declared in a lexical context as an object; we just want to extend that so it works across process boundaries.

To be clear, this isn’t exactly a deficiency in Nippy. There are a lot of good reasons why you should never do this! The semantics of serializing a fn instance at one point in time and deserializing it again later, possibly in another process or even on another machine, leaves a lot of opportunities for mistakes to be made. Suffice it to say that your use cases should make it easy to ensure that fn implementations — the actual code — are always available when and where you deserialize an instance, and that they remain consistent across time. In our case, these characteristics are relatively easy to guarantee, so we can use this strategy safely.

One of the best things about Nippy is how easy it is to extend via the extend-freeze and extend-thaw APIs. Using this extension mechanism, we added transparent support for serializing and deserializing fn instances. Once you require the com.rpl.nippy-serializable-fn namespace, nippy/freeze! and nippy/thaw! will handle fn instances, either at the top level or nested within other objects, without any additional syntax:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
(ns example
  (:require [taoensso.nippy :as nippy]
            [com.rpl.nippy-serializable-fn]))
 
(defn my-fn [a b c]
  (+ a b c))

(def thawed-myfn
  (nippy/thaw! (nippy/freeze! my-fn)))
(thawed-myfn 1 2 3) ; 6

(let [x 10
      afn (fn [a b c] (+ a b c x))      
      fn-bytes (nippy/freeze! afn)      
      thawed-fn (nippy/thaw! fn-bytes)]  
  (thawed-fn 1 2 3) ; 16  
  )

And that’s it! In our experiments, performance of freezing and thawing fns was comparable to freezing and thawing Clojure Records with similar amounts of “stuff” in them. You can find this extension on the Red Planet Labs GitHub, or add it as a dependency via Clojars.

A look under the hood

From the programmer perspective, there are (broadly) two “kinds” of fns in Clojure: declared fns (via defn) and anonymous fns (via (fn [] …)). But from an implementation perspective, what we really end up caring about is two different factors: whether the fn has anything in its closure, and whether it has metadata associated. At the highest level, to serialize a fn, we need to capture the “type” of it, anything in the closure, and any associated metadata. If we have all those things serialized in the byte stream, then we can reconstruct that fn instance at a later point in time.

Let’s take a quick side track into how Clojure actually implements fns. At compile time, Clojure generates a Java class for each fn. The most obvious part of this class is the invoke method, but there’s also a constructor, and possibly some fields (more on these later). When you interact with a fn instance in regular Clojure code, you are interacting with an instance of this generated Java class. When you invoke a Clojure fn instance, you are literally calling the invoke method with the provided arguments.

When a fn instance has nothing in it’s closure, then it’s really “static” in the Java sense of the term. Invocation relies only on arguments to the invoke method, not on any additional context. Whether you’re dealing with a declared fn or an anonymous fn, all you need to serialize in order to capture the essence of this kind of fn is the generated Java class name. It can be deserialized just by reading the name out of the byte stream and instantiating the fn by class name. We go a little farther here and make sure that when you deserialize a declared fn, you actually get the singleton instance stored in the named global var.

What about when you have a fn with a lexical closure? This is where the generated constructor and fields come into play. The Clojure compiler determines referenced local vars at compile time, and then adds a constructor argument and a corresponding protected field for each one. When the resulting fn is instantiated at runtime, the constructor is called with the values of local vars, capturing them for future invocation. Their invoke method implementation refers to both the invoking arguments and the constructing arguments, providing all the needed context to complete computation.

To serialize fns with a closure, we need to serialize the type along with all the values stored in the protected fields. Luckily, detecting fields used for storing closure values is easy and reliable, so all we need to do is get ahold of those values and recursively serialize them. The tricky part is that, naively, the only way to do this is via Java Reflection, which is a big performance no-no. The good news is that the JVM has continued to improve in this department, and now offers a much more sophisticated and high performance API for doing “reflective” operations called MethodHandles. A full discussion of MethodHandles is outside the scope of this post, but the key detail is that a MethodHandle has virtually identical performance to regular method invocation or direct field access, but can be constructed from Java Reflection primitives. We use this combination of APIs to dynamically analyze a fn instance’s closure fields and then generate an efficient serializer function, so the Java Reflection costs are only paid once per unique fn type encountered.

The other top-level case we mentioned above is fns with metadata. This is something most Clojure programmers encounter primarily through protocol fns. The implementation of these fns is different yet again — all fns with metadata are an instance of a special wrapper class “clojure.lang.AFunction$1”, which does nothing except hold a metadata map alongside the actual fn instance, and dispatches invocations down to the actual fn. When we try to freeze a fn like this, all we have to do is serialize the metadata and the unwrapped fn in sequence using Nippy and we’re done!