Nobody Should Write ETL
Let our systems figure it out, while they still listen to us
This is essay #2 in the Symposium on Is the Orchestrator Dead or Alive? You can read more posts from Vinnie on his Substack,.
We’ve already accepted that engineers shouldn’t write ETL. “It’s a hot potato,” we said, so let’s “give people end-to-end ownership of the work they produce.” Let Data Scientists write ETL, let Data Analysts ingest their own data. Engineers don’t have time for that, they’re busy writing vendor glue and fixing the SaaSpool known as “modern data stack” after dbt published a breaking change in their most recent release, breaking all 37 connectors.1
I say this is a good step, but it’s a band-aid solution. All of the problems inherent with ETL workflows are still there, just someone else’s responsibility. Why don’t we just do away with it altogether?
Maybe I am taking the argument in the article too literally after the wine I was contractually obligated to drink for this symposium (thanks for fuelling my alcoholism, Stephen), but let’s run with it for a second. Let’s assume that the platform team is just keeping packages updated and writing more decorators we can all import into our scripts. We can not only
@time, our functions, we can also
@count how many times they’re called, automatically
@send_telemetry_to_infra_team and beg them to
@accept_this_pr. I promise this time the code is
The Data Analytics team is happy, they can click here and there and get metrics from Snowplow into Snowflake. They don’t even have to model the data themselves, it’s already done.
And let the Data Science team do the same. End-to-end ownership means end-to-end silos. Those analysts could never begin to understand the sheer magnitude of
import tensorflow as tf. Why use their data, it’s probably badly modeled anyway. We found our own, better package and can hook it up ourselves. Now if only the Airflow instance doesn’t crash when we push our new DAGs to prod.
Is it just me, or does this seem terribly inefficient? End-to-end ownership sounds great and the autonomy feels great, but even the author of the original piece says that “[they] are sacrificing technical efficiency for velocity and autonomy. It is important to recognize this as a deliberate trade-off.” It makes no mention of unnecessary duplication.
I guess I’ve been in one too many organizations where multiple teams were building the exact same thing without knowledge of one another (well, it’s only happened once, but I think it’s once too many) to accept this as a good solution. I may be too lost in abstractionland, but at a meta-level, most data work really isn’t that unique (for the sake of the argument, we’re not talking real-time or streaming processes), and as a matter of fact, it’s only the shape of the final exposure (yes, this could be another dbt pun) that really matters. Everything else going into it can, and should, be reused.
But if engineers aren’t writing ETL, and nobody really likes to do it, do we just accept this duplication? Do we just keep filling Snowflake’s pockets?
I think there’s a better way, and orchestration can be the answer. But not any orchestration, asset-aware orchestration.
Let’s all just stop writing ETL altogether. Let’s declare the outputs of our pipelines and let our systems figure out the rest. If humans don’t wanna deal with it, let the machines do it, at least while we can still command them.4
We declare our data assets in code. Our data platform knows how to create and maintain them. The engineers already wrote fancy abstracted vendor glue anyway, they might as well write a YAML config for it. The consumers, be it the data scientists or analysts, can stop importing the
@fancy_new_decorator and instead import the source asset into their project. That’s materialized once and always kept up to date. They tell the system their downstream use case depends on it, and then go to sleep. Or to a bar, or to the beach, and everyone lives happily ever after.
Maybe the very notion of ETL is wrong. In 2023, our data warehouse is not the final destination, it’s just another step in the journey of the data. ETL, and its cousin Reverse ETL (AKA
“ETL”[::-1]) were helpful as concepts a few years ago, but it’s time we start thinking about processes differently.
The world I wanna live in is one in which some team, be it the engineers or the data producers themselves, declare source assets, including metadata and contracts about its shape, and consumers can hook up to them, the system being able to intelligently tell where they’re saved and how often they’re updated, and taking care of keeping all downstream dependencies up to date. Maybe this whole article is just about solving the technical part of “data mesh”,5 but when we're still trying to hold our Data Scientist's hand (they might understand
import tensorflow as tf, but they would never understand the Liskov Substitution Principle),6 we might as well go all the way and take back ownership for the sources. We all know they weren’t writing unit tests for them anyway.
And yes, this whole article is just another way for me to shill for Dagster’s software-defined assets and declarative scheduling. Guilty as charged.
Any similarities with related incidents in your organization is purely coincidental.
Or “Stitches,” one of the many unanswered questions in the modern data stack.
Which may or may not have been updated since they first launched it in 2017.
Yes, I am scared about GPT-4.
Which happens to be the easier part, of course.
Not that I have, to be fair.