I remember having this feeling a few years ago. What I realized is that airflow has taught us a few bad habits and also brought ahead an interesting paradigm of the vertical workflow engine.
I agree airflow is old, legacy and ideally folks should not use it, reality is there is a lot of pipelines already built with it - sadly. I think as a community we have to start moving away from it for more complicated problems.
Disclaimer: I created Flyte.org and heavily believe in decentralized development of DAGs and centralized management of infrastructure
Agree! I enjoyed your episode on the DE podcast and your journey of beginning with Airflow, then needing to build a new orchestrator to tackle a greater variety of use cases.
I used Argo Workflows at my previous job, and the k8s job abstraction is a very clean one, which I think is why any team that needs to introduce machine learning into the stack starts to look away from Airflow. I haven't used Flyte but may give it a shot at some point. (Currently using Dagster)
Stephen, this was a fantastic read! I wholeheartedly agree with your candid reflections on Airflow's limitations and the need for platforms that support faster, simpler, and more decentralized data workflows. Your insights on the evolving role of data engineers truly resonate with the current shift towards enabling broader access and agility in data operations.
I also wanted to highlight a solution from our side at dlt, which you might find interesting given your quest for more flexible and less cumbersome tools. In our blog post ['https://dlthub.com/blog/first-data-warehouse'], we delve into the benefits of using the dlt library for building custom pipelines. Unlike the traditional heavy, centralized orchestrators like Airflow, dlt offers a more lightweight, Pythonic approach to data ingestion and transformation, promoting database agnosticism and facilitating easier migrations. This aligns well with the decentralized and autonomous data platform philosophy you advocate for.
yes airflow is NOT an ETL tool, but a scheduling tool
yes airflow 1 was buggy and super slow
yes airflow 2.3 is still not 100% stable
but
we should never confuse the airflow-operators and airflow itself
so many OSS operators are shitty and running transformations directly in airflow itself ( if not using the KubernetesExecutor or KubernetesCeleryExecutor )
I think what's missing with the Airflow project, that is getting picked up by other orchestrators, is two things:
1. An opinionated way of deploying the product that makes the data platform easier to manage and easier to scale. I do think there are opinions here (e.g. "use only K8s Job operators"), but in Airflow's world, making that seamless is the user's problem. I think Dagster is doing this really well, and because they are opinionated, Dagster cloud is actually positioned to take over all this work -- "just give us your code and config, and we handle the rest"
2. Some sort of "data asset" object to track metadata onto. If you look at all the machine learning orchestrators that have spun out the past few years, they are basically all schedulers + an opinionated model of the things that get produced. Pachyderm and dbt are also good examples of this. I am less certain of this as a necessary feature of the modern orchestrator, but it does seem to be the way other projects are moving.
This post got a lot of attention! I would encourage all readers to check out the conversation on Hacker News, which has a lot of great insights: https://news.ycombinator.com/item?id=32317558
Stephen, this was a fantastic read! I wholeheartedly agree with your candid reflections on Airflow's limitations and the need for platforms that support faster, simpler, and more decentralized data workflows. Your insights on the evolving role of data engineers truly resonate with the current shift towards enabling broader access and agility in data operations.
I also wanted to highlight a solution from our side at dlt, which you might find interesting given your quest for more flexible and less cumbersome tools. In our blog post ['https://dlthub.com/blog/first-data-warehouse'], we delve into the benefits of using the dlt library for building custom pipelines. Unlike the traditional heavy, centralized orchestrators like Airflow, dlt offers a more lightweight, Pythonic approach to data ingestion and transformation, promoting database agnosticism and facilitating easier migrations. This aligns well with the decentralized and autonomous data platform philosophy you advocate for.
First 3 of mentioned problems can be solved with official airflow helm chart https://airflow.apache.org/docs/helm-chart/. The 4th one (The control plane can ingest metadata from across workspaces via a separate service) I did not understand tbh, but there is an API to change connections / variables, etc. Yes, there is not enough developer tools, but if the rest of the system was designed to have airflow as a scheduler it should not be a problem to do CI/CD, for example https://medium.com/@FunCorp/practical-guide-to-create-a-two-layered-recommendation-system-5486b42f9f63 (disclaimer: I'm the author of the article)
Yeah, the issue here is not that they can't be solved. It's the fact that having to solve those problems (for example, by using Kubernetes Pod operators vs Python operators) puts the user into a different position than Airflow originally intended. Because now to run the platform I have to know Terraform, Helm, Kubernetes AND the DAGs themselves.
Great article by the way. Read it twice and got some ideas for our own ML stack
I remember having this feeling a few years ago. What I realized is that airflow has taught us a few bad habits and also brought ahead an interesting paradigm of the vertical workflow engine.
I agree airflow is old, legacy and ideally folks should not use it, reality is there is a lot of pipelines already built with it - sadly. I think as a community we have to start moving away from it for more complicated problems.
Disclaimer: I created Flyte.org and heavily believe in decentralized development of DAGs and centralized management of infrastructure
Agree! I enjoyed your episode on the DE podcast and your journey of beginning with Airflow, then needing to build a new orchestrator to tackle a greater variety of use cases.
I used Argo Workflows at my previous job, and the k8s job abstraction is a very clean one, which I think is why any team that needs to introduce machine learning into the stack starts to look away from Airflow. I haven't used Flyte but may give it a shot at some point. (Currently using Dagster)
Episode for those interested: https://www.dataengineeringpodcast.com/flyte-data-orchestration-machine-learning-episode-291/
Thank you for the kind comments. Definitely would love to connect and read more of your stories. The rawness in the frusta is genuine and refreshing 🙏
Stephen, this was a fantastic read! I wholeheartedly agree with your candid reflections on Airflow's limitations and the need for platforms that support faster, simpler, and more decentralized data workflows. Your insights on the evolving role of data engineers truly resonate with the current shift towards enabling broader access and agility in data operations.
I also wanted to highlight a solution from our side at dlt, which you might find interesting given your quest for more flexible and less cumbersome tools. In our blog post ['https://dlthub.com/blog/first-data-warehouse'], we delve into the benefits of using the dlt library for building custom pipelines. Unlike the traditional heavy, centralized orchestrators like Airflow, dlt offers a more lightweight, Pythonic approach to data ingestion and transformation, promoting database agnosticism and facilitating easier migrations. This aligns well with the decentralized and autonomous data platform philosophy you advocate for.
Looking forward to more of your insightful posts!
Best,
Aman Gupta,
DLT Team
context : I wrote https://towardsdatascience.com/apache-airflow-in-2022-10-rules-to-make-it-work-b5ed130a51ad
yes airflow is NOT an ETL tool, but a scheduling tool
yes airflow 1 was buggy and super slow
yes airflow 2.3 is still not 100% stable
but
we should never confuse the airflow-operators and airflow itself
so many OSS operators are shitty and running transformations directly in airflow itself ( if not using the KubernetesExecutor or KubernetesCeleryExecutor )
You should probably look into Flyte as well — as a remedy to all the Airflow-esque problems.
In the end, i still don't understand what features author missed: dynamic dags, metadata management,data quality?
I think what's missing with the Airflow project, that is getting picked up by other orchestrators, is two things:
1. An opinionated way of deploying the product that makes the data platform easier to manage and easier to scale. I do think there are opinions here (e.g. "use only K8s Job operators"), but in Airflow's world, making that seamless is the user's problem. I think Dagster is doing this really well, and because they are opinionated, Dagster cloud is actually positioned to take over all this work -- "just give us your code and config, and we handle the rest"
2. Some sort of "data asset" object to track metadata onto. If you look at all the machine learning orchestrators that have spun out the past few years, they are basically all schedulers + an opinionated model of the things that get produced. Pachyderm and dbt are also good examples of this. I am less certain of this as a necessary feature of the modern orchestrator, but it does seem to be the way other projects are moving.
This post got a lot of attention! I would encourage all readers to check out the conversation on Hacker News, which has a lot of great insights: https://news.ycombinator.com/item?id=32317558
Stephen, this was a fantastic read! I wholeheartedly agree with your candid reflections on Airflow's limitations and the need for platforms that support faster, simpler, and more decentralized data workflows. Your insights on the evolving role of data engineers truly resonate with the current shift towards enabling broader access and agility in data operations.
I also wanted to highlight a solution from our side at dlt, which you might find interesting given your quest for more flexible and less cumbersome tools. In our blog post ['https://dlthub.com/blog/first-data-warehouse'], we delve into the benefits of using the dlt library for building custom pipelines. Unlike the traditional heavy, centralized orchestrators like Airflow, dlt offers a more lightweight, Pythonic approach to data ingestion and transformation, promoting database agnosticism and facilitating easier migrations. This aligns well with the decentralized and autonomous data platform philosophy you advocate for.
Looking forward to more of your insightful posts!
Best,
Aman Gupta,
DLT team
First 3 of mentioned problems can be solved with official airflow helm chart https://airflow.apache.org/docs/helm-chart/. The 4th one (The control plane can ingest metadata from across workspaces via a separate service) I did not understand tbh, but there is an API to change connections / variables, etc. Yes, there is not enough developer tools, but if the rest of the system was designed to have airflow as a scheduler it should not be a problem to do CI/CD, for example https://medium.com/@FunCorp/practical-guide-to-create-a-two-layered-recommendation-system-5486b42f9f63 (disclaimer: I'm the author of the article)
Yeah, the issue here is not that they can't be solved. It's the fact that having to solve those problems (for example, by using Kubernetes Pod operators vs Python operators) puts the user into a different position than Airflow originally intended. Because now to run the platform I have to know Terraform, Helm, Kubernetes AND the DAGs themselves.
Great article by the way. Read it twice and got some ideas for our own ML stack