Discover more from Data People Etc.
Orchestration isn’t going anywhere
Whether it resides in many silos or a single plane, orchestration is inescapable
This symposium asks the wrong question
When Stephen asked me to participate in this symposium, my first reaction was “Are you sure?” After all, I am the founder of a company whose sole purpose is to build a system that many would call an “orchestrator.” I'm not exactly impartial.
The question assumes there is an agreed-upon definition for the orchestrator. If that definition is “the system whose sole responsibility is scheduling and ordering of tasks in production and nothing else,” then my answer is: it’s alive, but no one likes it, and it probably deserves to die.
If that’s my answer, then why am I doing what I am doing with my life? Worry not, this symposium has not caused some sort of existential crisis. Instead, I think “Is the orchestrator dead or alive?” asks the wrong question. Instead, we ought to ask about orchestration, not the orchestrator.
Orchestration is an essential capability and isn’t going anywhere. The right question is: what is the future of orchestration?
Orchestration versus orchestrator
Modern organizations build data products and assets to power analytics, ML, and their production applications. To do this, data practitioners use a variety of technologies to create data pipelines. From the first pipeline to the full data platform, you need orchestration.
It’s worth defining orchestration:
Orchestration is the coordination and management of multiple computer systems, applications and/or services, stringing together multiple tasks in order to execute a larger workflow or process. — Databricks
Data will, for the foreseeable future, be stored and computed in many storage systems and runtimes. All the data within an organization is not going to live in a single cloud data warehouse, executing on a single compute substrate. Organizational dynamics, economic realities, and technical constraints will not allow it.
But that definition of orchestration is too narrow in the data domain. Even with a single system, there is a dependency graph of data assets that constitute a data pipeline, as all data must come from somewhere and go somewhere. Computations to produce that data must be ordered and scheduled.
If you are a data practitioner, you are orchestrating whether you call it that or not. If you are writing a data notebook that processes a file dropped in S3 and manually running it once a day, you’re orchestrating. If you are using dbt on a single warehouse, you are orchestrating: somewhere in the bowels of that codebase, there is a topological sort determining the order of model execution and launching compute into the data warehouse. If you have set up cron jobs in Fivetran, Snowflake Tasks, and Hightouch to flow data through your platform, you are orchestrating.
Overlapping cron jobs in the modern data stack SaaS tools has become a popular way to avoid orchestration tools.
However, it’s just distributed orchestration, and that leads to operational silos. Good luck debugging an upstream data pipeline that breaks the ML team in a totally separate stack. It results in an operationally fragile data platform that leaves everyone in a constant state of confusion about what ran, what's supposed to run, and whether things ran in the right order.
Control plane, not an orchestrator
We at Dagster conceptualize ourselves as a data management tool. All production data assets in an organization are represented, in software, within Dagster. As they are software artifacts, change management is done with software engineering processes. Keeping those data assets up to date is an essential responsibility of Dagster, and so orchestration is a core capability.
This is a team- and tool-spanning layer, not a silo. Keeping a graph of assets up to date for stakeholders is the core function of any data, analytics, or ML engineering team. And those teams are stakeholders with respect to each other. Sharing a control plane, while bringing their own transformation and domain-specific tooling, is proper and natural.
This control plane also has an active metadata layer. Dagster streams information about the assets into an immutable, structured event log. Users can plug in their own metadata as well. This serves as a ledger for the data platform usable for many purposes: versioning, quality, and others. Directly within the system, users can schedule based on activity in this ledger. This can take the form of policy or explicit scheduling.
By its nature, a system of record of production assets combined with metadata naturally should and will incorporate lineage, data quality, cataloging, observability, and governance. With well-defined APIs, this will integrate an entire ecosystem of tools that provide specialized, higher-level functionality in all of those domains.
Here comes the iPhone/iOS analogy
What I am describing here is a rebundling dynamic. Try as I might, I cannot help but reach for the analogy to the iPhone – it is too apt.
Sending emails on the go, digital photography, contact management, texting, and voice communication are essential capabilities. That does not mean that Blackberries, mass-market digital cameras, PDAs, and flip phones deserved to survive as standalone devices.
Consolidating those capabilities into a single device was a watershed moment in personal computing, on the order of the broad adoption of the original PC.
And it wasn’t just the iPhone, it was iOS. All of those capabilities needed organization, coherence, and rules, or else the user would live in an untrusted, chaotic world. iOS did that. The grid of applications on your home screen is the manifestation of that ordered heterogeneity. The user can organize and catalog myriad capabilities within a single, trusted, coherent experience.
In our vision, the asset graph is data’s in-product manifestation of that ordered heterogeneity. Assets computed by any runtime and stored in any system conform to a common protocol. The graph they reside in is not a post hoc observation of your assets, but a system of record that is alive.
So just like the iPhone is still a “phone”, we might still end up calling the new type of orchestrator an orchestrator. But it will in no way resemble the orchestrators of the past. And any system that claims to be an orchestrator without these capabilities will be viewed as woefully deficient.
While the standalone orchestrator might be going out of fashion, orchestration is alive and well.
The “orchestrator” will fade in the same way the Blackberries, digital cameras, palm pilots, and flip phones did. But capabilities and tools are very different things. The capability that is orchestration is an essential, undeniable one. It must and will live on. The only question is where. And the answer to that question will be one of the most consequential in data infrastructure in the next decade.