Skip to main content

Datoms All the Way Down

Evan Phoenix & Paul Hinze

Every platform has to put its state somewhere, and a deployment system has a lot of it: apps, the versions of those apps, the config attached to a version, the routes pointing at it, the addons it depends on, the cluster it runs on, and the web of relationships between them all. One of the first questions you answer when you build something like Miren is where all of that should live.

The reflex answer is Postgres. There’s a whole genre of (often correct!) advice that says just use Postgres for everything, and our prototype happily did exactly that. It worked. But as we moved from “does this work” toward “people are going to run this themselves,” Postgres started to look like the wrong fit for us.

What the prototype taught us

Start with what we’re storing. The data is a sprawl of small, related things, dozens of distinct ones today, and the list grows every time we ship a feature. Almost every change to it is additive: a version grows an optional field, a new type appears, an existing one learns about a new relationship. We rarely delete a column or rewrite what existing data means; we’re nearly always bolting one more thing onto the side. (Not never. Sometimes we get a shape wrong and have to rearrange some data, and nothing that follows makes that free. But that’s the exception, and the exception isn’t what you optimize for.)

A relational database makes you pay for every one of those additions, and the cost is almost all ceremony. The ALTER TABLE ADD COLUMN is cheap, on modern Postgres it’s an instant metadata change. What’s expensive is everything around it: a migration file, a forward path, usually a backward one, a deploy sequenced so the code and schema agree at each step. And that’s for a single database. We run a fleet, every customer on their own copy, so one schema change becomes that same migration smeared across hundreds of clusters we don’t operate, each on its own version, some mid-upgrade. Migrations don’t fan out gracefully, and we’d be signing up to do it forever.

The second cost is heavier. Whatever the database is, our users have to run it, and we’re aiming at a lose-any-node setup, which means running it highly available. Standing up HA Postgres on someone else’s hardware, backing it up, keeping it alive at 3am: that’s real work, and pushing it onto our users was the thing we most wanted to avoid. A few of us lived that at HashiCorp for years. The scars were enough to make us stop and actually ask what we needed before reaching for Postgres again.

When we actually asked, the answer was that we needed almost none of what a relational database is for. Nobody runs reports or ad-hoc joins against this data. Our data is relational in shape, but its job is to drive a reactive control plane: hold the desired state, notice when it changes, and reconcile the world to match. A full relational query engine was far more than the job called for.

Enter the datom

So we took a step way back, all the way to a question that sounds like navel-gazing, but bear with us: what, at base, are we actually storing? Small facts. “This thing is named web-server.” “This thing listens on port 8080.” A row, a JSON object, a struct: each is really just a bundle of those facts under a single name.

Datomic, the database we borrowed our model from, is built on this and almost nothing else: it begins with the single fact and adds every other capability as a deliberate layer on top. The fact itself is a datom; a bag of datoms is an entity. In generic data-modeling terms, you’ll see this referred to as EAV, entity-attribute-value, because the irreducible fact is “this entity has this attribute set to this value.”

We start at the very bottom, with as little structure as the idea will allow. Forget any particular database, and forget our code; here is an entity, written as nothing but its facts:

app/name     "web-server"
app/port     8080
app/enabled  true

That is the ground floor, and it is deliberately almost nothing: no required fields, not even an identity. What ties these three facts together is just our choice to look at them as one. Where a relational row is a rigid tuple (every column the table defines, or you alter the table), this is the opposite: a bag that holds exactly the facts it holds, and grows another the instant you add one. (The quotes are only notation: web-server is a string, 8080 an int, true a bool. Those primitives, plus a few more, are the whole base type system; everything richer gets built up from them, layer by layer.)

If that makes the whole thing sound like a document store, that’s fair, since it has the same open, unstructured shape underneath. But an entirely schema-less store is just a junk drawer where anything can hold anything. The rest of our design is about layering that structure back in.

The schema is data too

Look again at that bare entity. Its keys are just strings; nothing about it says app/port must be a number, or that an app can have more than one. It stores facts and has no opinions.

To get opinions, we need a schema. And in this model, the schema is just more data: a small set of reserved attributes under the db/ prefix. We start by giving our bare entity the identity it was missing, set like any other attribute:

db/id        app/web-server
app/name     "web-server"
app/port     8080
app/enabled  true

db/id is the handle every lookup and cross-entity reference resolves through, and the engine leans on it constantly. But to the underlying entity, db/id is just another attribute sitting next to app/name. What makes it the key is a separate definition, which is itself an ordinary entity:

db/id           db/id
db/doc          "Internal entity ID"
db/type         db/type.ref
db/cardinality  db/cardinality.one
db/uniq         db/unique.identity

Read it as a set of facts: the type is a reference, the cardinality is one, and its uniqueness is marked as an identity. That last fact is what makes db/id the key. It’s not a rule baked into the engine from thin air; it’s a plain fact sitting on a plain entity, driving those engine behaviors.

And that schema entity has a db/id of its own: db/id. It is defined in terms of itself.

Every attribute in the system is described this way. The app/port from earlier isn’t a rule hidden in our Go code; it’s a definition sitting in the store that says “this is an integer, and you can have more than one”:

db/id           app/port
db/doc          "A port the app listens on"
db/type         db/type.int
db/cardinality  db/cardinality.many

This handful of db/* attributes is our entire schema language. And because it’s built from the same raw materials, it’s self-describing: the attributes that define what “type” or “cardinality” mean are themselves ordinary entities sitting in the store.

So the schema isn’t some privileged system layer sitting underneath the data. It is just more entities, sitting in the store beside the things they describe, read back the exact same way.

Compare that to SQL, where the schema lives behind a different door (ALTER TABLE, migration scripts, a system catalog). Here, “what does app/port mean?” is the same kind of query as “what port does web-server use?” You read an entity either way.

The migration that wasn’t

In a traditional relational database, holding multiple values for a single field usually means choosing between a second table with a join, or dropping relational guarantees to use a JSONB blob or array column.

The entity model bypasses that trade-off. To listen on several ports, a many-cardinality attribute simply appears on the entity more than once:

...
app/port     8080
app/port     443
...

Values don’t have to be scalars, either. An attribute can hold references to other entities, arrays, labels, or nested components. The data keeps its natural shape, rather than being shredded across normalized tables or stuffed into opaque blobs.

And the payoff that sent us down this whole road falls right out of that open structure: adding a field is not a migration.

A new attribute is just an attribute that older entities don’t happen to have. There is nothing to alter, nothing to backfill, and, crucially, no migrations to smear across a fleet of customer clusters.

Real reshapes are a different story, of course. When we actually need to change what existing data means (renaming an attribute, say), there is no magic; we have to write code to do it. But we get to meet each case on its own terms. Usually, that’s a simple Go function that runs at boot and walks every entity, updating the ones that need it. A few times we’ve been lazier, converting entities on the fly the next time they are loaded. Neither approach has called for a migration framework. At our current scale, blunt approaches work just fine, allowing us to stay simple now and only build heavier tooling if a future reshape actually demands it.

Validations are data too

Types get you a long way, but they’re coarse: saying app/port is an int doesn’t say it must be a real, valid port number. A rule like that needs a home.

Datomic lets you hang a JVM function off an attribute and run it at write time. We loved how far schema-as-data had carried us, and didn’t want to stop short of it here, so we made validation data too. The rule is a small expression in CEL, Google’s Common Expression Language, and it lives on its own little entity that the attribute’s schema points at. The check is a fact like any other:

// port must be a real port
value >= 1 && value <= 65535

// name must be a DNS-safe label
value.matches('^[a-z][a-z0-9-]*$')

Because a CEL rule is just a string, it stores and travels like any other attribute; there’s no custom code on a classpath somewhere that has to stay in sync with the schema. That keeps us honest to our own premise: the rule is data, not code pointed at by data. A bad write gets turned away at the door instead of discovered weeks later. It also keeps our type system from sprawling: rather than first-class Port, IPAddress, or DNSLabel types, we just keep a primitive and a sentence about what makes it valid.

The store is just a projection

We have described this entire system so far in the abstract. That isn’t an accident: it’s one of the things the model does best. An entity is just a logical collection of facts; the physical database is just a projection of those facts.

This strict boundary keeps the core model decoupled from storage. Datomic runs this same model on top of everything from Postgres to DynamoDB to Cassandra because its storage contract is simple: it treats the backing database as a passive key-value store for compressed binary segments. It doesn’t ask the database to run queries, index attributes, or even stream changes.

But a reactive control plane does need a change stream. To get one, Datomic runs a centralized, single-writer Transactor process that pushes transaction reports directly out to its peers.

For Miren, we wanted that same clean logical decoupling, but we didn’t want to build or operate a custom distributed messaging layer. We wanted a backing store that could hand us both optimistic concurrency and a reactive change feed natively.

We landed where Kubernetes did

Kubernetes settled this same question for its own state years ago: all of it lives in etcd, a key-value store, rather than a relational database. Kubernetes earned its reputation at a scale and generality most teams will never reach, and it asks a lot of its operators to get there; we’re building for the teams who’d rather not make all of those decisions. But the storage instinct at the bottom of it was one we were glad to borrow. We landed in the same place, with one difference that matters: a Kubernetes object is a whole typed thing serialized at its key, where ours is an open bag of attributes. Same substrate, a far more flexible schema riding on top.

etcd earns its spot because it provides two primitives that are foundational to a control plane: revisions and watches. Every write gets a monotonically increasing revision, which provides optimistic concurrency: you read an entity at a revision, and write it back conditional on that revision. If a conflict occurs, the write fails loudly instead of quietly overwriting changes.

With watches, you can subscribe to a key or a prefix to get a live stream of changes. This is why a control plane is so elegant to build on etcd: reconciliation becomes a matter of watching the desired state and responding, rather than polling and diffing the entire database. The reactive loop we wanted comes straight from the storage layer, without a separate message bus or a custom transactor network.

Then there’s the operational side of the equation. Because etcd is lightweight and has a well-understood operational model, we can embed it directly inside Miren. It ships with the platform, and Miren manages its lifecycle automatically, so users don’t have to install a database or manage replication themselves. Running etcd in a fully highly-available configuration (surviving the loss of any node) is on our near-term roadmap rather than in the box today, but the path is well-worn: every production Kubernetes cluster already runs on HA etcd. The operational weight we wanted to avoid with Postgres is weight we can carry internally, instead of handing it to our users.

It helps that our state is small. A typical cluster’s entire control plane (every app, version, and route) amounts to a few megabytes, which is orders of magnitude below etcd’s default storage limits. Because our data is bounded by configuration rather than transactional traffic, we aren’t at risk of outgrowing it. The only part that grows over time is the historical log of past versions, which we don’t trim today. It hasn’t been urgent at these sizes, but the growth is unbounded, so pruning old versions is something we’ll add before long. When at some point we do approach etcd’s storage ceilings, the operational limits of key-value stores are well-documented.

CBOR on the wire

For an entity to live in etcd, it first has to become bytes, and we encode it with CBOR, a compact binary format that’s something like a typed cousin of JSON. The design detail that makes this work is how values are represented: every value travels as a small [type, value] pair, with the type tag riding right alongside the data. So when we read 8080 back out we know it was an int64 and not a float, and a reference to another entity is unmistakably a reference, not a string that happens to look like one. The type we declared in the schema and the type carried on the wire agree, but because the tag rides with the value, a reader can decode an entity without stopping to look up its schema first.

We don’t lean on that self-describing property much today, but it is a powerful invariant to keep in reserve. We also sort an entity’s attributes before encoding to guarantee deterministic serialization. This makes comparing or diffing raw bytes straightforward.

Secondary indexes, by being present

etcd will hand back an entity the moment you know its id, but a control plane constantly needs the other direction: every app in a project, every route pointing at a service, or every sandbox belonging to a user. That’s a secondary index, and a bare key-value store won’t give you one.

Datomic solves this by running a heavy background indexing engine on the Transactor, which periodically merges new datoms into compressed index trees stored as segments. For a lightweight control plane, we wanted to avoid that asynchronous indexing lag and complexity. Because etcd supports atomic multi-key transactions, we can write our indexes synchronously as part of the entity write itself, using the only material etcd offers: more keys.

When an attribute’s descriptor sets db/index, every write to an entity lays down an extra key beside it, under a collection prefix:

<prefix>/collections/<hash>/<entity-id>

The <hash> is derived from the attribute and its value together (the attribute ID, its type, and its value bytes), so every entity sharing that exact attribute-value lands under the same prefix. To answer “which apps belong to project X?”, you hash that attribute-value pair and list the keys beneath it; each key’s tail is an entity id.

The presence of the key is the index. (We tuck the entity id into the value too, just to save a reader from parsing it back out of the path, but it’s the key’s existence that records the fact.)

This synchronous approach gives us two properties that are notoriously hard to get in traditional distributed indexing. First, because the index entries are written in the same transaction as the entity itself, they are guaranteed to be atomic: the index and the entity can never drift. Second, because etcd watches work on prefixes, you can watch an entire index collection. Subscribing to “every app in project X” gives you a live feed of matching entities as they are created, updated, or deleted. This allows our reactive control plane to extend all the way into our queries.

We only build equality indexes this way, and we initially wondered if we’d miss range queries (the “every entity with app/port under 1024” sort of question). In practice, we haven’t: fetching a small set by equality and filtering in memory has been more than fast enough at our scale. If we ever need a more sophisticated index shape, it is just a different arrangement of keys.

YAML schemas, generated Go

Up to now, a schema has been data in the store: app/port is a descriptor entity, a bag of db/* attributes like any other. Flexible, but not something you’d want to write or read by hand. Building those descriptor bags yourself, or pulling attributes out of an entity one string key at a time in application code, is loose and error-prone, the sort of thing static types are supposed to rule out.

So we author a level up. We write a schema in YAML and let a generator produce the rest. A schema defines one or more kinds, like app or route, each a named bundle of attributes:

domain: dev.miren.core
version: v1alpha
kinds:
  app:
    name:
      type: string
    port:
      type: int
      many: true
    project:
      type: ref
      indexed: true

We run this file through our schemagen tool via go:generate to produce type-safe Go code. For each kind, we get a struct, Encode and Decode methods to map between the struct and the underlying entity, and the registration logic that teaches the store how to validate incoming writes:

type App struct {
    Name    string
    Port    []int64
    Project entity.Id
}

func (a *App) Decode(e entity.AttrGetter) { /* ... */ }
func (a *App) Encode() []entity.Attr      { /* ... */ }

This is where our application logic actually lives. Day-to-day, we work with ordinary Go structs, letting the compiler catch type errors, while underneath, the data remains a flexible collection of (attribute, value) facts. Adding a field is simple: add a line of YAML, run go generate, and use it. There are no hand-written serialization routines to maintain, and no risk of the database representation drifting from the application code because both are derived from the same source of truth.

One entity, many schemas

Because every attribute is namespaced, nothing requires all of an entity’s attributes to come from the same schema. A single entity can carry attributes from multiple schemas simultaneously, answering to each of them.

We use this pattern constantly for cross-cutting concerns. For example, we have a shared metadata schema that tracks owning projects and labels. This metadata rides along on nearly every entity, alongside its kind-specific schema. An app entity is both a metadata entity and an app entity. A sandbox is both metadata and a sandbox.

It’s the same primitive doing double duty: just as we used multiple app/port attributes to represent multiple ports, we use a multi-valued entity/kind attribute to let an entity declare multiple kinds. Now it’s composing entirely different types, and the entity can honestly answer to all of them at once.

Composition is as simple as concatenation. Because each generated struct knows how to encode itself into raw attributes, we can instantiate an entity by combining multiple encoders:

store.CreateEntity(ctx, entity.New(
    (&Metadata{Project: projID}).Encode,
    (&App{Name: "web-server", Port: []int64{8080}}).Encode,
))

The metadata/* and app/* attributes sit side-by-side in a single flat entity. They can never collide because their namespaces are distinct.

If you have worked with Kubernetes, this design will feel familiar. Every Kubernetes object carries a metadata.name and metadata.labels, regardless of its kind. But in Kubernetes, those fields are hardcoded into every type definition; here, metadata is just another schema you opt into. When we want to introduce a new cross-cutting concern (ownership, soft-delete tombstones, audit trails), we simply write a small schema and attach it. None of our existing kinds or data storage layouts have to change.

Where this leaves us

Building this system took a lot of work: designing synchronous indexing on top of etcd, tuning prefix watches, and chasing down concurrency bugs. When you take a swing this big, there’s no guarantee the effort is going to be worth it.

So it’s really gratifying to say that, so far, it has been. The system mostly stays out of our way now, the kind of thing we don’t think about day-to-day, and it’s clean to extend as we grow. As we push toward a fully highly-available, “lose-any-node” cluster architecture, we’ll keep pressure-testing the bets we’ve made, learning where the flex points we anticipated were right and where we’ll need to adjust. The foundation under all of it is settled, though: datoms, CBOR, and a key-value store, all the way down. Turns out that’s a pretty good place to build a control plane.