I remember having this feeling a few years ago. What I realized is that airflow has taught us a few bad habits and also brought ahead an interesting paradigm of the vertical workflow engine.
I agree airflow is old, legacy and ideally folks should not use it, reality is there is a lot of pipelines already built with it - sadly. I think as a community we have to start moving away from it for more complicated problems.
Disclaimer: I created Flyte.org and heavily believe in decentralized development of DAGs and centralized management of infrastructure
yes airflow is NOT an ETL tool, but a scheduling tool
yes airflow 1 was buggy and super slow
yes airflow 2.3 is still not 100% stable
but
we should never confuse the airflow-operators and airflow itself
so many OSS operators are shitty and running transformations directly in airflow itself ( if not using the KubernetesExecutor or KubernetesCeleryExecutor )
This post got a lot of attention! I would encourage all readers to check out the conversation on Hacker News, which has a lot of great insights: https://news.ycombinator.com/item?id=32317558
First 3 of mentioned problems can be solved with official airflow helm chart https://airflow.apache.org/docs/helm-chart/. The 4th one (The control plane can ingest metadata from across workspaces via a separate service) I did not understand tbh, but there is an API to change connections / variables, etc. Yes, there is not enough developer tools, but if the rest of the system was designed to have airflow as a scheduler it should not be a problem to do CI/CD, for example https://medium.com/@FunCorp/practical-guide-to-create-a-two-layered-recommendation-system-5486b42f9f63 (disclaimer: I'm the author of the article)
Airflow's Problem
I remember having this feeling a few years ago. What I realized is that airflow has taught us a few bad habits and also brought ahead an interesting paradigm of the vertical workflow engine.
I agree airflow is old, legacy and ideally folks should not use it, reality is there is a lot of pipelines already built with it - sadly. I think as a community we have to start moving away from it for more complicated problems.
Disclaimer: I created Flyte.org and heavily believe in decentralized development of DAGs and centralized management of infrastructure
context : I wrote https://towardsdatascience.com/apache-airflow-in-2022-10-rules-to-make-it-work-b5ed130a51ad
yes airflow is NOT an ETL tool, but a scheduling tool
yes airflow 1 was buggy and super slow
yes airflow 2.3 is still not 100% stable
but
we should never confuse the airflow-operators and airflow itself
so many OSS operators are shitty and running transformations directly in airflow itself ( if not using the KubernetesExecutor or KubernetesCeleryExecutor )
You should probably look into Flyte as well — as a remedy to all the Airflow-esque problems.
In the end, i still don't understand what features author missed: dynamic dags, metadata management,data quality?
This post got a lot of attention! I would encourage all readers to check out the conversation on Hacker News, which has a lot of great insights: https://news.ycombinator.com/item?id=32317558
First 3 of mentioned problems can be solved with official airflow helm chart https://airflow.apache.org/docs/helm-chart/. The 4th one (The control plane can ingest metadata from across workspaces via a separate service) I did not understand tbh, but there is an API to change connections / variables, etc. Yes, there is not enough developer tools, but if the rest of the system was designed to have airflow as a scheduler it should not be a problem to do CI/CD, for example https://medium.com/@FunCorp/practical-guide-to-create-a-two-layered-recommendation-system-5486b42f9f63 (disclaimer: I'm the author of the article)