Everything wrong with databases and why their complexity is now unnecessary

This post is not going to be about what’s wrong with individual databases. There are so many databases and so many individual API issues, operational problems, and arbitrary limitations that it would be take forever to cover all of them. This post is about what’s wrong with databases as a collection. It’s about what’s wrong with databases as they exist today conceptually and have existed for decades.

Of course, something can only be wrong if there’s a different, better way to do things. There is, and we’ll get to that too.

Global mutable state is harmful

Every programmer learns early on to minimize the use of state in global variables. They do have the occasional legitimate use, but as a general rule they lead to tangled code that’s difficult to reason about.

Databases are global mutable state too. They’re actually even worse than global variables since interactions are frequently spread across multiple systems, making it even harder to reason about. Also, they’re durable. So if a mistake is made that corrupts the database, that corruption doesn’t get fixed by just fixing the bug. You have to manually figure out what got corrupted and fix it. In a lot of cases it’s impossible to figure out exactly what got corrupted, and you may not have enough information to correct the corruption perfectly. Your best option in these cases is to either revert to a backup or merge in partial data from a backup, neither of which are optimal.

Most programmers simultaneously believe global mutable state in variables is harmful while also believing global mutable state in a database is fine, even though they share most of the same issues.

The better approach, as we’ll get to later in this post, is event sourcing plus materialized views. There are many ways to go about applying that pattern, and it’s important to do so in a way that doesn’t create other complexities or performance regressions.

Data models are restrictive

Databases revolve around a “data model”, like “key/value”, “document”, “relational”, “column-oriented”, or “graph”. A database’s data model is how it indexes data, and a database exposes an API oriented around the kinds of queries that data model can efficiently support.

No single data model can support all use cases. This is a major reason why so many different databases exist with differing data models. So it’s common for companies to use multiple databases in order to handle their varying use cases.

There’a a better abstraction for specifying indexes, and it’s one that every programmer is already familiar with: data structures. Every data model is actually just a particular combination of data structures. For example:

  • Key/value: map
  • Document: map of maps
  • Relational: map of maps, with secondary indexes being additional maps
  • Column-oriented: Map of sorted maps

Data structures can be of huge size by being durable on disk, just like data models. This includes nested data structures. Read and write operations on durable data structures can be just as efficient as the corresponding operations on data models. If you can specify your indexes in terms of the simpler primitive of data structures, then your datastore can express any data model. Additionally, it can express infinite more by composing data structures in different ways.

Because only a tiny percentage of the possible data models are available in databases (since each database implements just one particular data model) it’s incredibly common for a database to not match an application’s needs perfectly. It’s extremely expensive to build a new database from scratch, so programmers frequently twist their domain model to fit the available databases. This creates complexity at the very base of an application. If you could instead mold your datastore to fit your domain model, by specifying the “shape” (data structures) precisely, this complexity goes away.

Specifying indexes in terms of data structures rather than data models is a big part of the approach to backend development we’ll look at later in this post.

Normalization versus denormalization problem

Every programmer using relational databases eventually runs into the normalization versus denormalization problem. It’s desirable to store data as normalized as possible to have a clear source of truth and eliminate any possibility of inconsistency. However, storing data normalized can increase the work to perform queries by requiring more joins. Oftentimes, that extra work is so much you’re forced to denormalize the database to improve performance.

Storing the same information multiple times creates the possibility of inconsistency if there’s a bug of any sort in processing. However, to meet performance constraints you’re forced to store the same information in multiple ways, whether in the same database or across multiple databases. And it’s not just RDBMS’s that have this problem. So the burden is on you, the engineer, to ensure complete fault-tolerance in achieving consistency for all code that updates those databases. Code like that is frequently spread across many services.

There’s a fundamental tension between being a source of truth versus being an indexed store that answers queries quickly. The traditional RDBMS architecture conflates these two concepts into the same datastore.

The solution is to treat these two concepts separately. One subsystem should be used for representing the source of truth, and another should be used for materializing any number of indexed stores off of that source of truth. If that second system is capable of recomputing indexes off of that source of truth, any bugs that introduce inconsistency can be corrected.

Once again, this is event sourcing plus materialized views. If those two systems are integrated, you don’t need to take any performance hit. More on this soon.

Restrictive schemas

Databases vary a ton regarding what kinds of values can be stored in them. Some only allow “blobs” (byte arrays), putting the burden of serializing and deserializing domain types on clients. Others allow a variety of types like integers, floating point numbers, strings, dates, and others.

It’s rare that you can store your domain representations in a first-class way in a database such that queries can reach inside your domain objects to fetch or aggregate information nested inside. Partially this is due to database implementation languages being distinct from application languages so they can’t interoperate in these ways. Sometimes you can extend a database to handle a language-neutral representation, like this extension for Postgres, but it’s cumbersome and has limitations.

It’s common to instead use adapter libraries that map a domain representation to a database representation, such as ORMs. However, such an abstraction frequently leaks and causes issues. This has been discussed extensively already, like here and here, so I don’t need to go into all the issues with ORMs again.

Being forced to index data in a way that’s different from your ideal domain representation is pure complexity. At the very least, you have to write adapter code to translate between the representations. Frequently, the limitations restrict what kinds of queries can be performed efficiently. The restrictiveness of database schemas forces you to twist your application to fit the database in undesirable ways.

This issue has been so universal for so long, it can be hard to recognize that this complexity is unnecessary. When you can mold your datastore to fit your application, including your desired domain representations, this complexity goes away.

Complex deployments

Databases do not exist in isolation. A complete backend requires using many tools: databases, processing systems, monitoring tools, schedulers, and so on. Large-scale backends oftentimes require dozens of different tools.

Updating an application can be a complex orchestration process of migrations, code updates, and infrastructure changes. It’s not uncommon for companies to have entire teams dedicated to deployment engineering.

On top of all this, to be production-ready you have to ensure everything has sufficient telemetry so you’re able to detect and diagnose any issues that may arise, whether performance or otherwise. Every tool has its own bespoke mechanisms for collecting telemetry, so getting everything gathered together into one monitoring dashboard is another non-trivial engineering task.

The complexity and cost of deployment is an artifact of the development model which currently dominates software engineering, what I call the “a la carte model”. On the surface, the a la carte model is attractive: pick the most optimal tool for each part of your architecture and make them work together.

The reality of the a la carte model doesn’t meet that ideal. As the tools are designed independently from one another, “making them work together” is oftentimes a ton of work, including the pain of building deployments. And as already discussed, the tools are usually far from optimal. Things like fixed data models and restrictive schemas mean you’re frequently twisting your application to fit your tools rather than molding your tools to fit your application.

If you take a step back and think about what we do as software engineers, the high cost of building applications doesn’t really make sense. We work in a field of engineering based on abstraction, automation, and reuse. Yet it takes hundreds or thousands of person-years to build applications that you can describe in total detail within hours – look at the sizes of the engineering teams behind pretty much every large-scale application. Even many small-scale applications require engineering effort that seems severely disproportionate to their functionality. What happened to abstraction, automation, and reuse? Why isn’t the engineering involved in building an application just what’s unique about that application?

The a la carte model exists because the software industry has operated without a cohesive model for constructing end-to-end application backends. When you use tooling that is built under a truly cohesive model, the complexities of the a la carte model melt away, the opportunity for abstraction, automation, and reuse skyrockets, and the cost of software development drastically decreases.

A cohesive model for building application backends

To go beyond databases to find a better approach to software development, you have to start from first principles. That’s the only way to break free from the shackles of decades of inertia of software architectures. So let’s clearly and rigorously define what a backend is, and then reason from there as to how backends should be structured.

The primary functions of a backend are receiving new data and answering questions on that data. Answering a question may involve fetching one piece of data that had been previously recorded (e.g. “What is Alice’s current location?”), and other questions may involve aggregations of lots of data (e.g. “What is the average bank account balance of people in Freedonia over the last three months?”). The most general way to answer a question is to literally run a function on all the data the backend has ever received:

1
query = function(all data)

Forget for a moment the practicalities of this, that your dataset may be 10 petabytes in size and your queries need to be answered within milliseconds. What matters is this is a starting point from which to think about backend design. Unlike the data models of databases, this clearly encapsulates all possible backends. The closer your backend design is to this ideal while meeting the necessary practical constraints (e.g. latency, scalability, consistency, fault-tolerance), the more powerful it will be. How close can you get to this ideal? In other words, what is the smallest set of tradeoffs necessary to arrive at a practical system?

It turns out all you have to do is add the concept of an index, a precomputed view of your data that enables certain queries to be resolved quickly. And so the above model becomes:

1
2
indexes = function(data)
query = function(indexes)

Every backend that’s ever been built has been an instance of this model, though not formulated explicitly like this. Usually different tools are used for the different components of this model: data, function(data) , indexes, and function(indexes) . In a typical RDBMS backend, an RDBMS is used for both data and indexes, with possibly other databases like ElasticSearch used for more indexing. Computation (both function(data) and function(indexes) ) is usually done either as part of an API server’s handlers or in background jobs managed with queues and workers.

Larger scale backends may use NoSQL databases like Cassandra, MongoDB, or Neo4j for indexing, Kafka for incoming data, and computation systems like Hadoop, Storm, or Kafka Streams for function(data) .

In all these cases backends are constructed with a hodgepodge of narrow tooling. None of these are general purpose tools for any of the components of backends (data, function(data) , indexes, function(indexes) ), able to satisfy the needs of that component for all backends at all scales for all performance requirements.

What this model does is provide a framework for a next-generation tool that takes all the needs of a backend into account. If a tool could implement all these components in an integrated and general purpose way – at any scale, fault-tolerant, and with optimal performance – the complexities described earlier in this post could be avoided.

That brings us to Rama, a backend development platform designed with these first principles at its foundation.

Rama

We announced Rama on August 15th with the tagline “the 100x development platform”. Since that sounds so impossible on the face of it, we paired our announcement with a direct demonstration of that cost reduction. We re-implemented Mastodon (basically the same as the Twitter consumer product) in its entirety from scratch to be able to run at Twitter-scale. To demonstrate its scale, we operated the instance with 100M bots posting 3,500 times per second at 403 average fanout. Twitter wrote 1M lines of code and spent ~200 person-years building the equivalent (just the consumer product), and we did it with Rama with 10k lines of code and nine person-months. Our implementation is open-source, complete, high-performance, and production-ready.

Twitter’s implementation was so much more expensive because of the complexities described earlier. For instance, to reach scale they had to build multiple specialized databases from scratch (e.g. social graph database, in-memory timeline database) because there were no databases that had the right data models. They have an extremely complex deployment consisting of a large number of different tools, with over 1M lines of just Puppet configuration.

These complexities and many others are completely avoided by our Rama-based implementation. Our solutions to the performance and scalability challenges of Twitter are similar (e.g. keep timelines in memory and reconstruct on read if lost, how we balance processing of an unbalanced social graph), but we were able to do it by simply composing the primitives of Rama together in different ways rather than build specialized infrastructure from scratch for each subproblem. Our performance numbers for our Mastodon implementation are as good or better than Twitter’s numbers.

This is the programming model of Rama:

These concepts correspond directly to the first principles described. What makes Rama so powerful is how it implements each piece so generally. Depots correspond to “data” and are distributed logs containing arbitrary data. “PStates” (short for “partitioned state”) correspond to indexes. You can make as many PStates as you need with each specified as an arbitrary combination of durable data structures. ETLs and queries are function(data) and function(indexes) respectively, and they’re expressed using a Turing-complete dataflow API that seamlessly distributes computation. Being Turing-complete is critical to be able to support arbitrary ETL and query logic.

How to use Rama is documented extensively on our website. That documentation contains a six-part tutorial that does a much better job introducing the API than I could possibly squeeze into this post. That tutorial explores Rama through the Java API, but Rama also has a Clojure API.

So let’s instead look at how Rama avoids the complexities that have plagued databases for so long. While Rama does everything a database does, like durably indexing data and replicating changes incrementally, it also does so much more. Rama handling both computation and storage is a big part of how it’s able to avoid these complexities.

Let’s start with the first complexity we looked at, databases being global mutable state suffering from the same issues as global mutable state in regular programs. Rama’s PStates serve the same role that a database does, but they are only writable from the ETL topology that owns them. Since every write to a PState is in the same ETL code, it’s much easier to reason about their state.

Fundamentally, PStates are materialized views over an event sourced log. So it doesn’t make sense for anything but the owning ETL topology to write to them. The combination of event sourcing and materialized views also addresses the other issue discussed earlier, that a bug deployed to production can corrupt a database in a way that’s difficult or impossible to fully correct. In Rama, a PState can always be recomputed from the depot data, which is the source of truth. This can completely correct any sort of human error.

The next complexity of databases was the restrictiveness of data models. We discussed how data structures are a much better way to specify indexes, and that each data model is just a particular combination of data structures. Being able to specify indexes in terms of data structures allows not just every existing data model to be supported, but also infinite more.

Rama’s PStates are specified as data structures. When developing Rama applications it’s common to materialize many PStates to handle all the different use cases of an application. For example, our Mastodon implementation has 33 PStates with a wide variety of data structures just for profiles, statuses, and timelines. Sometimes one PState handles 10 different use cases, and other times a PState exists just to support one use case.

PStates are durable, partitioned, and incrementally replicated. Incremental replication means there’s always another partition ready to take over if the leader partition fails, and it guarantees anything visible on a current leader will still be visible on subsequent leaders. These properties make PStates suitable for any use case handled by databases, including large scale ones.

The next complexity we covered was the normalization versus denormalization problem. Through being based on first principles, Rama inherently solves that by explicitly distinguishing between data (depots) and views (PStates).

The next complexity was the restrictive schemas of databases. One of the joys of developing with Rama is using your domain representations in every context, whether appending to a depot, reading/writing to PStates, or doing distributed processing in ETLs or queries. Any data representation is allowed, whether plain data structures like hash maps or lists, Protocol Buffers, or nested object definitions. There’s no difference between using any of them. If you want to use a type Rama doesn’t already know about, you just have to register a custom serializer.

The last complexity we discussed was complex deployments, and Rama addresses that too. Rama is an integrated platform capable of building an entire backend end-to-end. Rama applications are called “modules”, and a module contains any number of depots, ETLs, PStates, and query topologies. Rama provides built-in mechanisms to deploy, update, and scale modules. Each of these is just a one-liner at the terminal. All that complexity of deployment engineering when dealing with traditional architectures comprising dozens of pieces of infrastructure completely evaporates from Rama being an integrated system.

Some people get the wrong impression that Rama is an “all or nothing” tool, but in reality Rama is very easy to integrate with any other system. This allows Rama to be incrementally introduced into existing architectures.

Another great consequence of Rama being such a cohesive and integrated platform is the monitoring Rama provides out of the box. Since Rama is so general purpose, it’s capable of monitoring itself: collecting monitoring data, processing it, indexing it, and visualizing it. Rama provides deep and detailed telemetry on all aspects of a module. This telemetry is invaluable for understanding the performance of a module in production, detecting and diagnosing issues, and knowing when to scale.

That covers all the complexities about databases discussed earlier, and we’re just barely scratching the surface on Rama. The best ways to learn more about Rama are to go through the documentation, play with the publicly available build of Rama, study the short, self-contained, thoroughly commented examples in rama-demo-gallery, or study our Twitter-scale Mastodon implementation.

Conclusion

The software industry has been stuck near a local maximum for a long time with the a la carte model. The current state of databases is a consequence of backend development not being approached in a holistic manner.

Rama is a paradigm shift that breaks free of this local maximum. The complexities of databases that every programmer has gotten so used to are no longer necessary. The benefits of breaking out of that local maximum are very consequential, with a dramatically lower cost of development and maintenance. The 100x cost reduction we demonstrated with our Mastodon example translates to any other large-scale application. Small to medium scale applications won’t have as extreme a cost reduction, but the reduction in complexity is significant for smaller scale applications as well.

Finally, if you’d like to use Rama in production to build new features, scale your existing systems, or simplify your infrastructure, you can apply to our private beta. We’re working closely with each private beta user to not only help them learn Rama, but also actively helping code, optimize, and test.

You can follow Red Planet Labs on Twitter here and you can follow the author of this post on Twitter here.

14 thoughts on “Everything wrong with databases and why their complexity is now unnecessary

      1. No-one in their right mind would entrust their business to a free open source database (Postgres).

        Where’s the legal come-back when it all goes wrong? (And it will do, because there is no multi-mullion dollar company backing it up!).

        Oracle is the reference database all others seek to strive to be like for very good reason.

  1. Having a backend developer experience I agree on many of these claims.
    Most of the complexity I have to deal with on a large backend (or simply a service) is with state. ORM are indeed leaking abstraction that force us to adapt the business code to feat the DB needs. Data structure are not directly stored and need to be adapted. Data queries requires denormalization or materialized views to be efficient. Schema migration is painful and requires some downtime…
    So, event sourcing + CQRS at the rescue seems to be a good idea. But it takes time to put together the technologies and integrate them before creating your real application. Providing a framework that take that burden away seems to be a very good idea !
    I’d like to test Rama but my app is in Python and Go. Do you plan to provide bindings?

  2. Firstly I must say, everything was really well put and everything you explained here I have been through so Kudos on the article. The idea of even having a single platform that sort of takes care of all these issues as a whole is very interesting.
    Is there any future plans to support more languages such as c#?

    1. > The idea of even having a single platform that sort of takes care of all these issues as a whole is very interesting.

      Yeah, use Oracle and the problem goes away. These are non-issues with Oracle.
      Like I say, they invent these problems & then try to sell you the solution.

      Just use the one that’s been around since the 70s. Initially developed for use by the CIA. Used by thousands upon thousands of companies worldwide today.

      Tried & tested. Just use Oracle !

      (PS. Nobody uses Python or C# as a DB language & Oracle has its own built-in language which is integrated tightly with the DB for max. performance that you will not — nay, cannot see — with any other language).

  3. Congrats on launching! It’s awesome to see how far you’ve come since our coffee chat in 2018!

    I’ve long believed in the event sourcing + materialized view worldview. I was a big fan of Datomic. Also Eve had elements of this vision. I would love to see it brought to my stack. It’s not clear to me to what extent Rama *is* a database or a system that includes a piece that replaces what I’d think of as a database. Personally, I’d like it to conceptually be a database, a la Datomic vs a whole system like Eve.

    I want to mostly keep doing what I’m doing, but instead of dealing with deploying and managing postgres, schema migrations, and ORMs, deal with something that’s more functional, like you describe. And of course I’d love it to be fully in JavaScript: the event emitting, index logic, reducer/materializing logic, and querying. I can understand you choosing the jvm for enterprise reasons though.

    In my experience, going off and using a newfangled and wholistic database is a risky proposition. Databases like postgres are swiss army knifes that solves a ridiculous number of problems because they’ve accreted decades of grey-bearded wisdom. I’m curious how you’re planning to solve the lock-in problem.

    1. You can think of Rama as a system that includes a piece that replaces what you currently use databases for.

      The “lock-in problem” is just a perception thing. Rama is actually easier to migrate onto or off due to integrating computation and storage. That means it’s comparatively much easier to apply things like the Strangler Fig pattern.

    1. You just need to add it as a Maven dependency according to the instructions here: https://redplanetlabs.com/docs/~/downloads-maven-local-dev.html

      From there, you can try Rama using InProcessCluster, which simulates a full Rama cluster in process and has identical functionality to a real cluster.

      rama-demo-gallery is an open-source project you can clone which has a number of modules and corresponding tests utilizing InProcessCluster. https://github.com/redplanetlabs/rama-demo-gallery

Leave a Reply to Steve Krouse Cancel reply

Your email address will not be published. Required fields are marked *