Clojure, Faster

Dynamic, functional programming languages like Clojure are sometimes considered “slower” than their statically-typed and/or OO counterparts, due to facilities like dynamic function dispatch and immutable-orientation, etc. But this comparison is too general.

Clojure is readily optimizable in cases where utmost performance is key. Clojure embraces the “Get it working, then make it fast” approach to building up a software system—even large-scale systems with high performance demands¹.

This post is about micro-optimizing Clojure code: how can we improve our Clojure code performance on the JVM when we need it the most, when hot code paths and tight loops prevail—for example, in large-scale data processing systems.

General JVM performance, benchmarking, and tuning is a broad and deep skillset and there are many more considerations beyond Clojure. The focus here is Clojure-specific.

Clojure Performance Tools

Before we get to our specific list of Clojure optimizations, we should emphasize that running your own benchmark and profiling experiments is essential to producing specific top-level performance outcomes for your own code and domain.

You could follow all of the advice in this post, shower your code with type hints, remove laziness and sequences, etc, etc, and⁠ without a specific performance model for where your costs are coming from and how to measure improvement, it is possible or even likely to see no dip in your processing latencies or reduction in your cloud bill².

We recommend the following tools for performing your own analyses; these were all used in developing this post:

  • criterium A benchmarking tool that is a Clojure industry-standard at this point; most of the relative benchmarks in this post are generated from the REPL using criterium.
  • clj-async-profiler For minimal quick and dirty Clojure profiling. REPL-enabled.
  • no.disassemble For inspecting Clojure-compiled byte code from the REPL. (If you’re familiar with javap, the output is what you would see from decompiling classes using javap -c on the command-line, except you can see what a compiled Clojure function looks like.)
  • jmh (Java Microbenchmark Harness) A feature-rich benchmarking toolkit for the JVM. Its usage is heavily Java/annotation-based (so not REPL-friendly) but it has an expressive feature set, making it possible to run more specialized benchmarks such as ones that could deadlock (and many others; see here.) We find it essential in carrying out a comprehensive benchmarking analyses and we’ll demonstrate out how to use it with Clojure a bit later on.

Avoiding Laziness

Clojure’s sequence abstraction and laziness are powerful programming facilities but we must shy away from these tools when memory and compute are in high demand.

Why? Lazy implementations require generating per-element state and this can be substantial overhead when compared to non-lazy alternatives like arrays and vectors whose elements each contribute just their own value’s size to memory.

Accordingly there is computation overhead in laziness due to this per element construction; compare lazy implementations of map and filter to transform values in a vector with a non-lazy (transducer-based) alternative:

1
2
3
4
5
6
7
8
9
10
(let [data (vec (range 100))]
    (report-result
      (quick-benchmark
        (into [] (map inc (filter even? data))) {}))
;; Execution time mean : 3.916001 µs

    (report-result
      (quick-benchmark
        (into [] (comp (filter even?) (map inc)) data) {})))
;; Execution time mean : 1.738479 µs

into has first-class transducer support; the same over-2x gains follow when using transducers + transduce in favor of laziness + reduce.

Avoid laziness where scale matters.

first and last

This one is particularly frustrating. We might expect first and last to have special consideration of their provided type. But this is not the case; they are implemented over Clojure’s lazy sequence API without respect of type. Even with a vector, last will obtain the result using a full linear pass over the data. Both first and last seq the input unnecessarily which is additional computation cost.

Of course we have a way out⁠—but the moral of the story is to stay well away from first and last when optimal performance is required. Here’s a full sequence of benchmarks; notice the dramatic relative cost of first and last on vectors. Pick the latencies you like the best:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
(let [v (into [] (range 1000))]
    (report-result
      (quick-benchmark
        (first v) {}))
;; Execution time mean : 53.459498 ns
    (report-result
      (quick-benchmark
        (nth v 0) {}))
;; Execution time mean : 8.477276 ns
    (report-result
      (quick-benchmark
        (.nth ^Indexed v 0) {}))
;; Execution time mean : 7.898334 ns
    (report-result
      (quick-benchmark
        (last v) {}))
;; Execution time mean : 13.215272 µs  (!!!)
    (report-result
      (quick-benchmark
        (nth v (-> v count dec)) {}))
;; Execution time mean : 93.989401 ns
    ;; The cost of the previous expression is primarily in the count; so we can improve it:
    (report-result
      (quick-benchmark
        (nth v (-> ^IPersistentVector v .length dec)) {}))
    )
;; Execution time mean : 7.155083 ns

Edit: Clojure’s peek will answer the last item of a vector optimally. The benchmark for peek matches precisely the mean execution time of nth + .length in the last benchmark above. [Credit]

equiv and equals

An occasional need in high-performing applications is a read-heavy, in-memory cache or value registry, commonly with some kind of composite/tuple for its map keys. It turns out that the choice of datatype (for both map and key) are… erm… key here.

Consider the following benchmarks for a simple map lookup using different combinations of map type (Clojure’s PersistentHashMap vs. Java’s HashMap) and composite key type.

PersistentHashMap[“a” “b”]167 ns
(ArrayList. [“a” “b”])95 ns
“a:b”78 ns
(→CacheKey “a” “b”)116 ns
HashMap[“a” “b”]36 ns
(ArrayList. [“a” “b”])53 ns
“a:b”33 ns
(→CacheKey “a” “b”)152 ns
Benchmarks for (get map key) for different map and key types. CacheKey is a basic defrecord type; “a:b” is a simple String with the composite values concatenated, etc.

This is quite a lot of variation in performance: JDK’s HashMap performs significantly better (except when using a record key—we’ll look at that one later). What accounts for this?

Clojure defines equality (i.e., clojure.core/=) more generically than Java’s .equals and this definition is used by Clojure’s data structures for equality comparison, PersistentHashMap in this case:

1
2
3
4
5
6
7
8
9
(let [v ["a" "b"] v2 ["a" "b"]]
    (report-result
      (quick-benchmark
        (= v v2) {}))
;; Execution time mean : 91.347726 ns
    (report-result
      (quick-benchmark
        (.equals v v2) {})))
;; Execution time mean : 10.741904 ns

Significant performance difference for these composite (vector) keys.

Of course, Clojure’s = is more general than Java .equals, which is a stricter comparison in several cases; here is just one example:

1
2
3
4
(.equals [1 2] [(int 1) 2])
⇒ false
(= [1 2] [(int 1) 2])
⇒ true

So we need to be sure that the keys we use in our map are carefully constructed to always compare correctly with the stronger .equals semantics.

Lastly, using a defrecord (CacheKey) appears to offer the worst performance. Let’s use some lightweight profiling to understand what is going on here, by comparing records vs vectors as keys:

1
2
3
4
5
6
7
8
(let [k (→CacheKey "a" "b")
      m (HashMap. {(→CacheKey "a" "b") ::val})]
    (prof/profile
      (dotimes [i 100000000] (get m k)))) ;; Figure 1
  (let [k ["a" "b"]
        m (HashMap. {["a" "b"] ::val})]
    (prof/profile
      (dotimes [i 100000000] (get m k)))) ;; Figure 2
Figure 1. Map lookup using a defrecord as the key. This flamegraph shows relative time spent in execution of each function in the call stack — generated from the REPL by clj-async-profiler.
Figure 2. Map lookup using a vector tuple as the key.

From our profiling, the map equality used by the CacheKey record looks suspicious. We can confirm its performance is much worse than vector equals:

1
2
3
4
5
6
7
8
  (let [ka (→CacheKey "a" "b") kb (→CacheKey "a" "b")]
    (quick-benchmark
      (.equals ka kb) {}))
;; Execution time mean : 2.015044 µs
  (let [ka ["a" "b"] kb ["a" "b"]]
    (quick-benchmark
      (.equals ka kb) {})))
;; Execution time mean : 11.376671 ns

The performance overhead we’ve discussed here is minimal for map lookups of singular values like keywords; we are generally okay to stick to Clojure datatypes for data entities where these key types are idiomatic. Still, using records (defrecord) for data entities can have a significant twofactor impact on key lookups:

1
2
3
4
5
6
7
8
  (let [k (→CacheKey "a" "b")]
    (quick-benchmark
      (:a k) {}))
;; Execution time mean :  5.638519 ns
  (let [k {:a "a" :b "b"}]
    (quick-benchmark
      (:a k) {})))
;; Execution time mean : 10.495481 ns

To make a broader generalization to wrap up this section, the JDK ecosystem at large has battled-hardened collection and data structure offerings for high performance use cases and Clojure’s Java interop (with type hints!) gives us access to any of these that make sense for our code hot spots.

Dynamic Vars vs. ThreadLocal

Another occasional need is for a per-thread state mechanism. Clojure has dynamic vars and binding, but the Clojure implementation of reading a dynamic var is far less optimal than using a ThreadLocal directly:

1
2
3
4
5
6
7
8
9
10
11
12
13
(def ^:dynamic *dyna-state* nil)
(def ^ThreadLocal thr-local-state (ThreadLocal.))

(binding [*dyna-state* 100]
    (report-result
      (quick-benchmark
        (.get thr-local-state) {}))
;; Execution time mean : 6.269224 ns
    (report-result
      (quick-benchmark
        *dyna-state* {}))
;; Execution time mean : 42.911467 ns
    )

Even the resolution of var values can be optimized by use of ^:const metadata, which makes the var value inlined at compile-time:

1
2
3
4
5
6
7
8
9
10
11
12
13
(def ^:const ^Long const-val 99)
(def ^Long var-val 99)

(do
    (report-result
      (quick-benchmark
        (unchecked-multiply 1 const-val) {}))
;; Execution time mean : 2.712172 ns
    (report-result
      (quick-benchmark
        (unchecked-multiply 1 var-val) {}))
;; Execution time mean : 4.876835 ns
    )

Of course this means any redef of const-val here will go unnoticed by any previously compiled references to it.

Clojure Transients

For initializing immutable data types with many values, Clojure’s transients are fast:

1
2
3
4
5
6
(quick-benchmark
   (persistent! (reduce conj! (transient []) (range 100000)))
;; Execution time mean: 1.189867 ms

   (reduce conj [] (range 100000))
;; Execution time mean: 2.709462 ms

This benchmark is roughly the implementation of clojure.core/into, by the way, which uses transients for performance—so certainly use into when you can. Transients are well-covered by the documentation but I mention them here for coverage.

Dramatic Gains with Type Hinting

As a Clojure programmer, you will inevitably encounter type hinting but what may come as a surprise is how dramatically it can improve performance.

Clojure is adamantly dynamic, which means Clojure code can compile without needing to know (or even having access to) the types of values that will flow at runtime. In the following function, x and y could take on the value of an array, map, String, or many other types that might show up dynamically:

1
2
(defn same-count? [x y]
   (= (count x) (count y)))

Using criterium, we can baseline this code without having compile-time type information for x and y:

1
2
3
(quick-benchmark
   (same-count? "f" "g") {}))
;; Execution time mean : 71.037835 ns

Reading the source for clojure.core/count we find lots of “instance of” logic to pick and invoke the relevant behavior at runtime. This is all expensive and we can circumvent it entirely by committing to types at compile-time using type hints:

1
2
3
(defn same-count? [^String x ^String y]
   (= (.length x) (.length y)))
;; Execution time mean : 3ns

Note the whopping 20x improvement in mean execution time.

It should also be noted that if we omitted the ^String type hints from this last implementation of same-count? we would have created the worst of both worlds and our average execution time observed by criterium would be 40x worse than baseline (that is, 20x worse than the initial implementation based on count.)

Arrays

Type hinting Java .instanceMember accesses like above is common. But type hints are essential for performant invocation of Clojure’s array functions, as well.

For example, this:

1
(alength my-arr)

is 1000 times more expensive when compiled without type information. The following clocks in at 5.491867 µs mean execution time:

1
2
3
(let [my-arr (identity (long-array 5))]
   (report-result
      (quick-benchmark (alength my-arr) {})))

By removing the (identity) wrapper there (so that Clojure can infer the array type), or by type hinting like so:

1
(alength ^longs my-arr)

…the invocation will reduce to 4.12 ns. Huge.

As a final note, clean benchmarking requires us to be aware that no subtle implicit optimizations are happening where we don’t expect them. Just above we wrapped our array construction with identity to “hide” the type of the array from Clojure’s compiler.

no.disassemble can come in handy here and let us see if reflection is kicking in:

1
2
3
4
5
6
(require '[no.disassemble :as nd])
(nd/disassemble (fn [] (let [my-arr (identity (long-array 5))] (alength my-arr))))

...
38  invokestatic clojure.lang.Reflector.invokeStaticMethod(java.lang.Class, java.lang.String, java.lang.Object[]) : java.lang.Object [53]
...

Compare this to the disassembled bytecode when type-hinting or inference is used:

1
2
3
4
5
(nd/disassemble (fn [] (let [my-arr (long-array 5)] (alength my-arr))))

...
15  invokestatic clojure.lang.RT.alength(long[]) : int [29]
...

We can see clearly that Clojure RT/alength is directly invoked with the proper array type; avoiding reflection for the enormous performance gain.

The Cost of Destructuring

Clojure’s destructuring is elegant and generally performant, but we can get significant performance wins by avoiding it when it matters.

clojure.core/destructure gives us a peek at its mechanics:

1
2
3
4
5
6
(destructure '[[a b c] [1 2 3]])

[vec__3071 [1 2 3]
 a (clojure.core/nth vec__3071 0 nil)
 b (clojure.core/nth vec__3071 1 nil)
 c (clojure.core/nth vec__3071 2 nil)]

destructuring sequential collections, as we can see, leverages nth which evaluates a series of “reflective” conditionals (much like count, as we saw above) to determine the type of its arg in order to dispatch the correct behavior.

How does nth’s performance compare with other facilities?

Arrays

Compare nth on an array vs using aget directly:

1
2
3
4
5
6
7
8
9
10
(let [v (long-array [1 2 3 4 5])]
  (report-result
    (quick-benchmark
      (nth v 4) {}))
  ;; Execution time mean :  114.954441 ns

  (report-result
    (quick-benchmark
      (aget v 4) {}))
  ;; Execution time mean : 5.319098 ns

The aget is a 30x gain over the nth (and destructured) variants. (Again note that the argument to aget should be type-inferred/hinted or else we’ll find it to perform the worst of any variant.)

Vectors

Can destructuring Clojure’s first-class vector types be improved upon?

Consider:

1
2
3
4
5
6
7
8
9
 (let [v [1 2 3 4 5]]
    (report-result
      (quick-benchmark
        (nth v 2) {}))))
;; Execution time mean : 5.9 ns
    (report-result
      (quick-benchmark
       (.nth ^IPersistentVector v 2))))
;; Execution time mean : 3.8 ns

A modest, yet noticeable, 1.5x improvement.

Narrow results like this one are a perfect time to apply suspicion. Let’s attempt to cross-check our benchmark using another tool, JMH—if at least to demonstrate how to use JMH to benchmark Clojure behavior:

Benchmarking Clojure with JMH

JMH requires you to write your benchmarks as Java methods annotated with @Benchmark. To invoke Clojure we will need to rely on Java-to-Clojure interop.

In order to analyze our .nth vs nth performance from JMH, we can compile (eval) Clojure function defs from Java:

1
2
3
4
5
6
    static IFn eval_ = Clojure.var("clojure.core", "eval");
    static IFn dot_nth_fn =
      (IFn) eval_.invoke(Clojure.read(
         "(fn [^clojure.lang.Indexed x i] (.nth x i :not-found))"));
    static IFn nth_fn =
      (IFn) eval_.invoke(Clojure.read("(fn [x i] (nth x i :not-found))"));

We will also need some data to invoke our function; we’ll use a static value in this case (though for more sophisticated benchmarks we could leverage declarative JMH params to analyze how performance changes with input):

1
2
static IPersistentVector cljVector =
    (IPersistentVector) eval_.invoke(Clojure.read("(vec (range 25))"));

And now we can declare our benchmark methods:

1
2
3
4
5
6
7
8
9
10
11
12
13
    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    public Object dot_nth_vector() {
        return dot_nth_fn.invoke(cljVector, 2);
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    public Object nth_vector() {
        return nth_fn.invoke(cljVector, 2);
    }

(Note: some of the boilerplate jmh setup has been omitted in these code snippets, the omitted boilerplate is well-documented here.)

Here are our results using Java HotSpot version 1.8.0_241:

1
2
3
Benchmark                           Mode  Cnt  Score   Error  Units
ClojureNthBenchmark.dot_nth_vector  avgt   10  4.380 ± 0.407  ns/op
ClojureNthBenchmark.nth_vector      avgt   10  6.034 ± 0.156  ns/op

It seems we’ve confirmed our 1.5x gain with JMH, almost to a tee. Still, skepticism should be the rule in benchmark-land. One advantage of JMH is that our benchmarks end up declaratively codified and can be readily recompiled and re-run with various configurations.

Not changing any code, we can recompile using Java 14, and see the above disparity presumably optimized away:

1
2
ClojureNthBenchmark.dot_nth_vector  avgt   10  3.506 ± 0.281  ns/op
ClojureNthBenchmark.nth_vector      avgt   10  3.301 ± 0.376  ns/op

Instead of spending the rest of this section running through permutations of JMH configuration, machine type, and JDK versions, we will leave this to the curious reader. The primary goal here is to stoke suspicion, especially when benchmarking results are narrow.

As a last note on being effective with JMH, we can tidy up our experience a little bit by not having to eval-compile functions defined in Java Strings from within Java code, as we did above. We can define namespaces of benchmark-able functions and load/compile them from our JMH Java class:

1
2
3
4
5
6
static final IFn require = Clojure.var("clojure.core", "require");
static {
   require.invoke(Clojure.read("com.acme.my-benchmarking-suite"));
}
IFn dot_nth_fn =
    (IFn) Clojure.var("com.acme.my-benchmarking-suite", "dot-nth");

Short-Circuiting

Like several of Clojure’s dynamic functions, in the source code of the nth function, we see a series of logical branches to select target behavior based on the provided argument type. The first of these checks in nth is Clojure’s Indexed type; by making this the first dynamic type check in its conditional reflection logic, nth short-circuits in the most common case of a Clojure collection datatype.

We should apply the same trick to our own code. When short-cutting on conditional logic in our code (say, for one example, when using cond), we benefit by placing the most-commonly true conditions first. Here is a contrived example:

1
2
3
4
5
6
7
8
9
10
(let [x "2020"]
    (report-result
      (quick-benchmark
        (cond
          (instance? Long x) :long
          (instance? Integer x) :int
          (instance? Short x) :short
          (instance? Date x) :date
          (instance? String x) :string) {})))
;; Execution time mean : 6.4 ns

If String is the most commonly flowing type, we pay a big price⁠—by moving the String instance check to the top we reduce to 2.4 ns, a 2.5x improvement.

Final Notes

Keep in mind the famous Kent Beck-attributed aphorism “Make it work, then it make it fast,” or the Donald Knuth-attributed “Premature optimization is the root of all evil.” Generally Clojure makes for high developer productivity and solid performance out of the box; but for areas where optimal performance is a must, getting there from within Clojure is well within reach.


Footnotes

¹ In the end, and rarely necessary, one can always hand-roll Java code and invoke it from Clojure. There is always a way to full optimization if tuning the Clojure itself is not obvious; Clojure’s Java interop is solid and Lein and Maven are both capable of simple polyglot build configurations.

² Some of the fodder for this post came from performance tuning my open-source project thurber; one overall outcome of that effort was a 2x reduction in auto-scaling costs for the game engine Beam pipelines (used as integration tests for that project) when run on the GCP Cloud Dataflow platform. But careful focus on optimizing the specific hot code paths was key to that top-level outcome.

5 thoughts on “Clojure, Faster

  1. Seems like Specter is worth a mention here — I believe this takes care of perf. of first/last, length, etc., as well as a host of more advanced cases, no? This way Specter will do the most efficient thing for type and you don’t need to worry as much.

    e.g.: (select-first LAST v)

    1. Yes, thanks, specter is worth the mention, and is highly optimized. However, while the FIRST and LAST navigators do the optimal underlying operation (ie, nth 0 and peek, respectively) there is nontrivial Specter overhead if you *only* require the first or last elem of a vector; consider:

      (nth v 0)               avgt    8      8.366 ±   0.488  ns/op
      (select-first FIRST v)  avgt    8     46.661 ±   2.316  ns/op
      (peek v)                avgt    8      7.478 ±   0.326  ns/op
      (select-first LAST v)   avgt    8     43.516 ±   3.760  ns/op
      
      
      				

Leave a Reply