12 Comments
Aug 2, 2022Liked by Stephen Bailey

I remember having this feeling a few years ago. What I realized is that airflow has taught us a few bad habits and also brought ahead an interesting paradigm of the vertical workflow engine.

I agree airflow is old, legacy and ideally folks should not use it, reality is there is a lot of pipelines already built with it - sadly. I think as a community we have to start moving away from it for more complicated problems.

Disclaimer: I created Flyte.org and heavily believe in decentralized development of DAGs and centralized management of infrastructure

Expand full comment
author

Agree! I enjoyed your episode on the DE podcast and your journey of beginning with Airflow, then needing to build a new orchestrator to tackle a greater variety of use cases.

I used Argo Workflows at my previous job, and the k8s job abstraction is a very clean one, which I think is why any team that needs to introduce machine learning into the stack starts to look away from Airflow. I haven't used Flyte but may give it a shot at some point. (Currently using Dagster)

Episode for those interested: https://www.dataengineeringpodcast.com/flyte-data-orchestration-machine-learning-episode-291/

Expand full comment

Thank you for the kind comments. Definitely would love to connect and read more of your stories. The rawness in the frusta is genuine and refreshing 🙏

Expand full comment
Aug 29Liked by Stephen Bailey

Stephen, this was a fantastic read! I wholeheartedly agree with your candid reflections on Airflow's limitations and the need for platforms that support faster, simpler, and more decentralized data workflows. Your insights on the evolving role of data engineers truly resonate with the current shift towards enabling broader access and agility in data operations.

I also wanted to highlight a solution from our side at dlt, which you might find interesting given your quest for more flexible and less cumbersome tools. In our blog post ['https://dlthub.com/blog/first-data-warehouse'], we delve into the benefits of using the dlt library for building custom pipelines. Unlike the traditional heavy, centralized orchestrators like Airflow, dlt offers a more lightweight, Pythonic approach to data ingestion and transformation, promoting database agnosticism and facilitating easier migrations. This aligns well with the decentralized and autonomous data platform philosophy you advocate for.

Looking forward to more of your insightful posts!

Best,

Aman Gupta,

DLT Team

Expand full comment
Aug 10, 2022·edited Aug 10, 2022Liked by Stephen Bailey

context : I wrote https://towardsdatascience.com/apache-airflow-in-2022-10-rules-to-make-it-work-b5ed130a51ad

yes airflow is NOT an ETL tool, but a scheduling tool

yes airflow 1 was buggy and super slow

yes airflow 2.3 is still not 100% stable

but

we should never confuse the airflow-operators and airflow itself

so many OSS operators are shitty and running transformations directly in airflow itself ( if not using the KubernetesExecutor or KubernetesCeleryExecutor )

Expand full comment
Aug 1, 2022·edited Aug 7, 2022Liked by Stephen Bailey

You should probably look into Flyte as well — as a remedy to all the Airflow-esque problems.

Expand full comment

In the end, i still don't understand what features author missed: dynamic dags, metadata management,data quality?

Expand full comment
author

I think what's missing with the Airflow project, that is getting picked up by other orchestrators, is two things:

1. An opinionated way of deploying the product that makes the data platform easier to manage and easier to scale. I do think there are opinions here (e.g. "use only K8s Job operators"), but in Airflow's world, making that seamless is the user's problem. I think Dagster is doing this really well, and because they are opinionated, Dagster cloud is actually positioned to take over all this work -- "just give us your code and config, and we handle the rest"

2. Some sort of "data asset" object to track metadata onto. If you look at all the machine learning orchestrators that have spun out the past few years, they are basically all schedulers + an opinionated model of the things that get produced. Pachyderm and dbt are also good examples of this. I am less certain of this as a necessary feature of the modern orchestrator, but it does seem to be the way other projects are moving.

Expand full comment
author

This post got a lot of attention! I would encourage all readers to check out the conversation on Hacker News, which has a lot of great insights: https://news.ycombinator.com/item?id=32317558

Expand full comment

Stephen, this was a fantastic read! I wholeheartedly agree with your candid reflections on Airflow's limitations and the need for platforms that support faster, simpler, and more decentralized data workflows. Your insights on the evolving role of data engineers truly resonate with the current shift towards enabling broader access and agility in data operations.

I also wanted to highlight a solution from our side at dlt, which you might find interesting given your quest for more flexible and less cumbersome tools. In our blog post ['https://dlthub.com/blog/first-data-warehouse'], we delve into the benefits of using the dlt library for building custom pipelines. Unlike the traditional heavy, centralized orchestrators like Airflow, dlt offers a more lightweight, Pythonic approach to data ingestion and transformation, promoting database agnosticism and facilitating easier migrations. This aligns well with the decentralized and autonomous data platform philosophy you advocate for.

Looking forward to more of your insightful posts!

Best,

Aman Gupta,

DLT team

Expand full comment

First 3 of mentioned problems can be solved with official airflow helm chart https://airflow.apache.org/docs/helm-chart/. The 4th one (The control plane can ingest metadata from across workspaces via a separate service) I did not understand tbh, but there is an API to change connections / variables, etc. Yes, there is not enough developer tools, but if the rest of the system was designed to have airflow as a scheduler it should not be a problem to do CI/CD, for example https://medium.com/@FunCorp/practical-guide-to-create-a-two-layered-recommendation-system-5486b42f9f63 (disclaimer: I'm the author of the article)

Expand full comment
author
Aug 6, 2022·edited Aug 6, 2022Author

Yeah, the issue here is not that they can't be solved. It's the fact that having to solve those problems (for example, by using Kubernetes Pod operators vs Python operators) puts the user into a different position than Airflow originally intended. Because now to run the platform I have to know Terraform, Helm, Kubernetes AND the DAGs themselves.

Great article by the way. Read it twice and got some ideas for our own ML stack

Expand full comment