In 2022, data engineers manage forests, not trees
I wrote an entire blog post trying to pin down why I dislike Airflow. But despite my rationalizations, it came out like a break-up letter — just way too personal:
I tried to make it work, I really did. But you are too old, your abstractions are clunky, and I think you’re ugly. It’s over between us.
Which would have been fine, except I knew exactly how Airflow would respond:
Steve — Sorry for taking so long to get back to you, been getting installed like 10,000 times a day. No problem re: feelings, LMK if you change your mind. -A
The reality is, Airflow is an achievement: an open source project that has penetrated the data psyche to an unsettling degree. It does what it says it does, which is more than most tools can claim. Teams use it at scaleand in 2022. I mean, there is
this >> [offensive, obscene] >> python >> syntax… but any project should be lucky to be so disliked, so boring, so inevitable.
So why can’t I stomach it?
Then I realized: My problem is not with Airflow. It’s with Airflow’s problem.
Thanks for reading Data People Etc.! Subscribe for free to receive new posts and support my work.
Who orchestrates the orchestrator?
Here is one of the first sentences in Airflow’s README file:
Airflow works best with workflows that are mostly static and slowly changing.
This was a great design in 2015 when Airflow was open-sourced. It’s still good enough for most teams in 2022 — slow, chunky, and centralized DAGs express most existing data value, even if they are unbundled into other SaaS tools.
My perspective, though, is that most future value will arise out of enabling teams to build data workflows that are faster, simpler, and more decentralized.
In fact, in the year after Airflow was open sourced, Jeff Magnusson of StitchFix wrote “Data Engineers shouldn’t write ETL”and called for a radical departure from centralized ownership of data pipelines: “[Our engineers] are not optimizing the organization for efficiency, we are optimizing for autonomy.”
Now, leftward pressure is everywhere: business users must learn analysis, analysts must practice engineering, and engineers must architect platforms.
Yet Airflow was never intended to be a heterogeneous platform intended for decentralized DAGs. It is a job scheduling and processing engine: take a single team’s workload and orchestrates it on a schedule, akin to a subway system.
The job of today’s data engineers is more akin to managing the entire transportation network — subways, sure, but also streets, buses, bike lanes. When the growth team drops 1000 scooters on the streets overnight — jerks!! — data engineers have to ensure they don’t cause accidents or get people killed. That is the new job.
What’s confusing is that the shift has been subtle and inconsistent across companies. Responsibilities resemble those of the old world, while requiring radically different mindsets. Below are three prominent examples of the shift.
Shift 1: “We know the lineage” to “We know what in god’s name is happening”
In the old world, if you understood data ancestry, you had a pretty good grasp of your data platform. Showing that “Business Action A comes from Dashboard B comes from View C comes from Table D comes from CRM API E” was no easy task, but it allowed you to identify broken points in the graph and address them before the next run of the daily batch job.
Nowadays, the platform teams face correlational problems as well:
From 10:00am to 10:30am, AWS us-east-1 had an outage. Snowflake was inaccessible during this time, affecting your data replication, observability, transformation, reverse ETL, and dashboard tooling. How do you identify and restart the necessary processes?
Understanding the impact of new data assets, of bad data assets, of system outages across a sprawling set of processes is hard, and resembles the work of site reliability engineers. And the pain gets more acute as you add more users, tools, and use cases to the platform — in other words, it scales with value.
Shift 2: “We unblock analysts” to “We enable everyone”
Data finds a way, regardless of our best intentions. Data tucked neatly into Snowflake will show up in BI tools, of course — but also emails, Slack, CRMs, Retool apps, ML models, customer-facing products, product analytics tools, and whatever the hell “native data apps” turn out to be.
That’s a feature, not a bug, of data. And it’s the job of data engineers to make the sustainable way of using data and the easiest way of using data one and the same. That’s what dbt has done with “creating tables” and look at its effect: dbt projects become semantic gravity wells, agglomerating new tables for all sorts of use cases. It becomes barely-contained chaos where neophytes and elites brush shoulders — not unlike some of the world’s great cities.
Shift 3: “We debug our DAGs” to “We debug our data”
In Flint, Michigan, functional water pipes corroded and started leaching lead into the water, causing irreparable harm to children and irreversible damage to trust in the local government. In data, the stakes are nowhere near as high, but a poisoned watering hole causes similarly widespread damage to reputation and coordination.
All of which means, as the flow of data becomes a table stakes commodity to all sorts of operational processes, data engineers can’t sit back and just watch it flow. We need new frameworks, protocols, and processes to ensure that the data is consumable: that personal data is treated appropriately, that toxins from upstream aren’t leaching into the metrics, that subtle drifts in balance won’t ruin the machine learning models. A platform team may not be on the hook for solving each of these problems, but they certainly provide leadership, frameworks, and tools for diagnosing them early.
Airflow as a Service as a Platform
So what is my problem with Airflow? My problem is that Airflow was not designed to address these problems — it lacks the ambition we need, even while occupying the critical pedestal as the foundational execution engine.We don’t need a better Airflow, but we need a higher-level one: a system that enables data platform teams to think at a platform level.
In fact, Airflow is already displaced. Airflow qua Airflow is already obsolete, and it happened right within the Airflow ecosystem. It’s called Astronomer.
Astronomer bills itself as the “Fully Managed Apache Airflow Service”. But it’s not. Astronomer is to Airflow as Snowflake is to the database. It’s a management system, and it shows us what the future of data engineering really looks like:
A top-level “control plane” that allows you to spin up an Airflow deployment in its own Kubernetes environment
Each Airflow deployment has health metrics visible through Grafana and Prometheus dashboards
Astronomer integrates at the organization level with identity management systems
The control plane can ingest metadata from across workspaces via a separate service
Developer efficiency tools, like integrated Github actions, secrets management, and simple dev/stage/prod promotion workflows
If it sounds like you could simply replace Airflow with basically any other job execution engine, that’s because you could. Astronomer’s value proposition lies not in the individual deployments, but in the control plane, and in (eventually) all of that beautiful metadata each of those engines will emit. In the Astronomer paradigm, Airflow’s greatest value is as a marketing tool.
dbt has proven the value of having a semantic abstraction layer on top of modern data platforms. But it’s not enough. There’s an area of the stack yet to be addressed — one which requires knowledge of Terraform, AWS, Kubernetes, CI/CD, networking, and security principles — that also needs to be simplified and adapted to the needs of data teams. The tool data engineers need to be effective in this new world does not run scripts, it organizes systems.
So Airflow… it’s not going to work out between us. But I realize now that it’s not you, it’s me. You were built for a world I’m not interested in living in, and I think we should go ahead and move in our separate directions. - Sincerely, Stephen
William Gibson’s statement that “The future is here, but it’s unevenly distributed” comes to mind.
This doesn’t preclude engineers from building workflows that make Airflow more modular, but there’s a clear contrast between the modular design of platforms like Prefect and Dagster.
Not unlike a certain slate of candidates in
It’s worth noting here that Dagster and Prefect have occupied this multi-domain ecosystem
I remember having this feeling a few years ago. What I realized is that airflow has taught us a few bad habits and also brought ahead an interesting paradigm of the vertical workflow engine.
I agree airflow is old, legacy and ideally folks should not use it, reality is there is a lot of pipelines already built with it - sadly. I think as a community we have to start moving away from it for more complicated problems.
Disclaimer: I created Flyte.org and heavily believe in decentralized development of DAGs and centralized management of infrastructure
context : I wrote https://towardsdatascience.com/apache-airflow-in-2022-10-rules-to-make-it-work-b5ed130a51ad
yes airflow is NOT an ETL tool, but a scheduling tool
yes airflow 1 was buggy and super slow
yes airflow 2.3 is still not 100% stable
we should never confuse the airflow-operators and airflow itself
so many OSS operators are shitty and running transformations directly in airflow itself ( if not using the KubernetesExecutor or KubernetesCeleryExecutor )