Data Materialization is a Convergence Problem
We've spent years shoving a square peg into a round hole
This is essay #7 (of 10) in the Symposium on Is the Orchestrator Dead or Alive? You can read more from Alex on his professional website, Bits On Disk.
If you’ve already got a workflow orchestrator, it’s very tempting to treat data materialization – the act of turning abstract data models, typically defined in SQL, into relational tables in your data warehouse / lake / lakehouse / swamp / outhouse / etc. – as an orchestration problem. After all, the problem sounds like a straightforward set of related, imperative tasks: make a job DAG where every task represents a model and edges represent data dependencies between models, long-poll data sources until they become “fresh”, freshen models in data-dependent order, repeat frequently enough to meet your recency SLOs.
Easy, right?
Unfortunately, models are rarely that well-behaved. Models change all the time, data dependencies between models can be complex, and SLOs change as business demands shift. Backfills, data deletion requests, and run-of-the-mill bugs in query logic happen all the time. The more complex and dynamic your data becomes, the more shoehorning is required to fit all that dynamism into the workflow orchestrator’s imperative, scheduled, task-oriented abstraction. All too often, the result is a giant shambling mess of interdependent job DAGs that are tedious to operate, difficult to reason about, and dangerous to change.
I’ve become increasingly convinced that we’ve been trying to shove a square peg into a round hole by treating data materialization as a workflow orchestration problem. Data materialization isn’t an orchestration problem, it’s a convergence problem, and we need a new system to handle it.
Why Convergence?
I’m far from the first person to have longed for a convergence-based solution to data materialization. Benn Stancil wrote a post on his blog in August of last year called “Down with the DAG”, where he laments the current state of workflow orchestration in the (ugh) Modern Data Stack. The entire post is worth a read, but this excerpt sums up his main point:
I don’t actually want to think about when to run jobs, how to define DAGs, or to manually orchestrate anything. I just want my data to be fresh—where I can declare what fresh means—and to know when it’s not.
Earlier in this symposium, Vinnie echoed the same sentiment:
Let’s all just stop writing ETL altogether. Let’s declare the outputs of our pipelines and let our systems figure out the rest. If humans don’t wanna deal with it, let the machines do it, at least while we can still command them.
In a perfect world, we’d tell the system what models we have, when we need them, and maybe what queries we’re running on them, and let some kind of data auto-materialization system do the grunt work of turning those models into tables in our warehouse. Instead, we’re left telling a workflow orchestrator both what we want and how to get it, and having to do a lot of undifferentiated toil when either of those things change.
In the abstract, the job of this hypothetical data auto-materialization system is to continually compare the state of the warehouse against the desired state of the models and their SLOs and make modifications to the warehouse until the models are materialized and the SLOs remain satisfied. We want the reality of the data in our warehouse to continuously converge toward our model definitions as those definitions change.
The software world is littered with solutions to convergence problems. Infrastructure-as-Code tools like Terraform and CloudFormation are convergence-based tools that read a representation of the desired end state, compare that desired state to the current state of the infrastructure, and construct a plan to converge the current state with the desired one. Kubernetes is also a convergence-based tool: its controller-manager compares the set of declarative resource specifications with the current state of the cluster and makes changes to the cluster (adding and removing cluster nodes, starting and stopping containers, etc.) until all the desired resources are running.
In a convergence-based view of data materialization, a “materialization controller” could periodically evaluate the state of all models, inspect their materialized counterparts, and issue queries to the warehouse to partially or completely (re-)materialize them as necessary. If a few of a model’s partitions need to be recomputed, the user could simply drop the impacted partitions and allow the controller to notice their absence and re-populate them. The controller could choose how to materialize a model by comparing the costs of each materialization method and adjusting the materialization method in the background as workloads or model sizes change. It could also handle prioritizing materialization queries based on the warehouse’s current load and the models’ SLOs, automatically shifting costly materializations that can tolerate some delay to off-hours. All the incidental complexity that was once the domain of human operators disappears, leaving data teams to focus more on the data and less on the nitty-gritty technical minutia of getting it there.
Applying convergence to data materialization appears to be an idea whose time has come. Dagster is already taking a crack at this idea with their notion of software-defined assets. What I’m calling “convergence” here, they’re calling “reconciliation”, but we’re both essentially hitting the same high points. I haven’t played with Dagster’s implementation enough to form an opinion on it yet, but what I’ve seen so far looks promising (though defining assets as Python functions make the declarative purist in me a little itchy).
If the benefits of this approach are so apparent and work is already underway, you might wonder why we’re not already swimming in competing materialization controller implementations. I think it’s because building one is going to be really hard.
The Trouble with Convergence
The industry has enough collective production experience and battle scars with convergence-based systems to know that they’re not as easy to build or operate as they might first appear. In particular, we need to be wary of three big problems with building and operating convergence-based systems: explainability, over-eagerness, and drift.
The first big problem with convergence-based systems centers around explainability. Everything looks like magic in a convergence-based system when things are going well, but if something goes wrong it can be difficult to understand why. If you’ve ever gotten Kubernetes stuck trying to launch a pod, you’ll know what I’m talking about. Detailed machine- and human-readable logs would go a long way toward making the system easier to troubleshoot and could also serve as a rich piece of lineage information that could be used elsewhere. Those logs would need to be built into the system as a first-order architectural concern to be truly useful, though. Bolting logging onto the side once the system is already built won’t cut it.
The second big problem with convergence-based systems is that it’s easy for an over-eager controller to spend a lot of money and/or resources either doing too much or doing it too fast. The controller will need some guardrails to prevent it from trying to rebuild the universe from scratch in reaction to a misconfiguration or a hastily done refactor. These guardrails aren’t just necessary for emergencies or controller meltdowns, however. Even in normal operation, operators will need to be able to tell the controller to either stop or slow down sometimes, if only to keep the system from live-locking itself when a lot of models change at the same time.
The third and arguably thorniest problem with convergence-based systems is drift. If the controller were the only thing changing the warehouse, it could keep a cache of the warehouse’s state and use that cache to decide what actions to take. With that local cache as its source of truth, it could iterate faster and lighten the query load on the warehouse itself. Unfortunately, the controller is rarely the only thing manipulating the warehouse, and its state cache can get stale very quickly. Even if you could prevent anything but the controller from modifying the warehouse, you probably wouldn’t want to; you need some way to break the proverbial glass and bypass the controller in an emergency.
There are a few ways you might be able to keep the controller’s cache warm in the presence of drift. The controller could hook into some kind of notification system so that it’s aware of changes relatively quickly. Data warehouses are good at exposing what’s being done to them via query logs, and we can attach post-commit hooks to version control systems, so both systems are ready sources of notifications. Even with that notification system in place, there will still be some latency between a change happening and the controller becoming aware of that change. To account for that latency, you’ll need to assume some amount of staleness and limit the controller to operations that are “safe” if the controller’s view of the world is out-of-date. Restricting the controller to either idempotent or reversible actions seems like a good way of doing this, but that feels like it’ll be one of the trickier parts to get right.
I, For One, Welcome our Convergent Overlords
While there are a lot of challenges facing the implementers of a materialization controller, I’m cautiously optimistic that it can be done. We’re in such an early stage that we can and should be bringing all the lessons that we’ve learned from prior convergence-based systems into the development of this one. I’m hopeful that we’re gaining clarity on not just what’s breaking with workflow orchestrators, but why it’s breaking, and that we’ll use that clarity and a healthy dose of past experience from adjacent parts of software engineering to build a new generation of even better tools.
All hail declarative data engineering.