Good data engineers are lazy

Airflow's neighborhood must be razed

Mar 30, 2023

This is essay #4 in the Symposium on Is the Orchestrator Dead or Alive? You can read more posts from Benoit on his Substack,

From An Engineer Sight

The goal of engineers is to be as lazy as possible. If your system doesn't allow you to sit and wait for completion, you're missing something.

That was probably the day I knew I wanted to be an engineer.

The IT teacher was writing some wizardry on a very desaturated terminal projected on a big white wall. He said this sentence while I was on the point to get a "wow effect" on one of my first Python code snippets. It’s now carved in my memory.

Though a bit sarcastic, I really think he highlighted something that day.

What's our job - data engineers - if not trying to reach end-to-end automation?

I once wrote that we don't need orchestrators. We need orchestrators, but not ones that end up in spaghetti, not ones that need Italian data engineer chefs to maintain its codebase.

The orchestrator is often our best friend when speaking about automation, but I think we still lack something. We are to a certain extent just a replacement for old BI tools... We should drive greater value.

Data engineers are not used to their full potential. Writing the same Python code over and over, dealing with the same data issues, migrating systems to new ones, etc. This is not really an engineering job, is it?

The outcomes we should look for are automation at its full potential, laziness at its climax, and optimized engineering costs. Our main duty should be to design architecture, to build systems that allow us to be lazy, not manage all this stuff we have built.

How to clean up our city?

As Antoine de Saint-Exupery said, "a designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away."

We definitely missed something with the modern data stack. Unbundling and bundling cycles, some would say. Or we just got blinded by marketing and hoped that our problem could be solved by a myriad of new tools.

The orchestrator’s neighborhood is not a pretty sight. It’s like a continuation of gray buildings trying to grow wherever they can while one wonderful house keeps standing here in the middle of the street. There are shadows on this house. (There are actually several houses.)

What we might just want is to bring some sunshine back to the city. To remove some of those ugly buildings and keep the ones that really support us. And why not at the same time raze some big houses nobody really wants to buy anymore?

Yes, the Airflow house is great. I loved Airflow’s garden, like any front-end developer who loved jQuery lighting at some point. But now the house is too big to keep all our furniture and decorations tidy.

“Airflow was not built for infinitely-running event-based workflows”.

Playing with words here, but even if that myriad of tools didn’t replace Airflow, they still highlighted something: we need a new central control plane to deal with our event-based reality and its infinitely running data flows.

Fortunately, we are starting to catch up.

Yes, we have and use better tools: dbt, new declarative orchestrators, Terraform, duckDB, CI/CD, etc... The declarative paradigm is nudging every part of the data stack.

But as disruptive as those new tools are, we still need vision, automation, and architecture design. We need the mayor to wake up and stop wondering if building a new pool will allow him to be re-elected.

Tools and codebases are often the elements of the debate. And while they support our concrete daily stuff, it's rare to take the same amount of time to think about the underlying system architecture they question and the legacy they will create.

One key to uncovering some vision and paving the way toward persistent roads is to ask ourselves probing questions :

Should my orchestrator be in the central space of my data stack? Does it fit with my business needs?
Do I really need to pay for a tool to extract data from one place to another?
Should I ask pricy engineers to write the same low-value Python code over and over?
What's my codebase vision? Does my codebase need a vision? Or can I trash some code easily?
Is maintaining custom abstraction on top of abstraction a good idea (we often see custom Airflow codes in companies, but we have to remember that Airflow is already a layer of abstraction)?
Do my high-level execs understand what the game we are playing for?
Do my data analysts focus on business intelligence while understanding the need for good software engineering practices?
Do they consider themselves as "coders" even if their tools tell them to drag and click?1

Again, no straight answers. Those questions are only here to suggest a way. To bring forward that idea, that small trick we are looking for while solving our technologies and organizational issues.

Like a city, a data system is not something neat. It's growing and changing constantly.

Our orchestrator is like the city's traffic controller, responsible for coordinating the movement of data through the city's streets and ensuring that everything is running smoothly. But we have to be at the city council level. Where we create a vision for our city's future and design the infrastructure to support it.

And like any city council, we have to be lazy. Don’t take me wrong here, it’s being lazy in the right way: we have to be bored in advance to deal with angry citizens, solve traffic jams, find resources from the government, etc.

There will always be unforeseen challenges, changes in the environment, and new technologies to integrate. And while the traffic controller has a major role to play here, it's our role to build solid architecture and a clear vision to guide our entire city's development.

Every click, every drag is translated to code at some point. Declarative isn't just a trend, it's the realization that consistency and efficiency can't be done without a proper DSL. Drag & drop isn't bad at all, on the contrary. It allows speed and a great user experience. But it fails in automation and consistency. The declarative paradigm somewhat tackles this issue.

A guest post by

Benoit Pimpaud

Writing From An Engineer Sight, a periodic about data, engineering and design. https://fromanengineersight.substack.com/

Data People Etc.

Discussion about this post