Say what you want about the data space but it is never boring. Or at least it hasn’t been boring lately.
I remember how very green I was when I joined my first data team. Excited to be given the opportunity because… well it was boring and no one else wanted that job. These were the days when the scope of data teams was minimal, before data scientists were called the sexiest job of the 21st century. I remember reading that headline at the time and feeling confused. My day to day responsibilities were decidedly not sexy. Our main responsibility was to ensure everything ran in a predictable fashion with as little excitement as possible. But that was over a decade ago! Jumping forward to the end of 2022 I had almost forgotten those unsexy times.
The last few months have been tough for tech. Luckily data teams have not been hit too hard (based on my entirely anecdotal evidence). But I can’t say the same about the data stack. Teams are abandoning many of the tools they had added. We have all found out the complexity of the modern data stack does not mirror the maturity of the business or the number or people on the team, as much as it reflects the economy.
From my conversations, it is interesting to see what tools have been making the cut. High level it seems it has been a lot of last in, first out. Not many people are giving up on their data warehouse, orchestration or ETL (turns out databases are recession-proof). But some of the new tools have been sidelined or put on pause.
There are a few reasons some tools have not been able to differentiate themselves from critical components. First of all, the closer you get to the datastore, the harder it is to pivot. Once companies get their data in one place, they generally don’t like to move it somewhere else unless they have a very good reason. Even moving to similar technologies risks unintended changes and cascading issues that can add months of development time.
Databases are still the bread and butter of the data stack which everything else revolves around. It follows that there is a lot less need for data metadata analysis tooling without an actual database to point it at.
Also, data teams, not surprisingly, have a better understanding of the systems they have been dealing with for most of their careers. dbt’s biggest contribution to the community was not an overly elegant framework but a common language for people on small teams to communicate with other data folks working on similar projects. Everyone working in the industry before dbt had built their own version of it or tried to build one. They were rarely that successful but analysts at least understood what they were trying to accomplish.
As the original responsibilities of data teams became “solve problems”, the community moved onto other challenges especially as it became more insular. But here it became a little bit tricker. What should come next? Do you build a feature store? (I don’t mean to pick on this one example) There are many legitimate use cases for a feature store but not as many smaller data teams have experience building one or needing one. Luckily there were a growing number of offerings allowing you to jumpstart that journey. But jumping into the deep end of a complex project with a complex solution does not solve much of anything. Yes it makes things easier to get started but it quickly showcases the deficiencies of data teams.
A recent trait of the data community is to put a new name to an existing software practice and proclaim something new has just been discovered. As the goals of data applications became more ambitious and their scope expanded, there were more needs to support. For example, you now need someone or some tool to help with safe and consistent deployment. You might be thinking DevOps would fill that role but that would be wrong. The correct answer is you need DataOps… or MLOps… or AIOps… the point is it never existed before and it was up to the data team to develop this competency.
The increase in specialization in data teams while ignoring expertise from existing, non-data, teams led to more and more pressure to layer in new increasingly specific tooling. At its worst, data teams would just build parallel engineering orgs. This could help the agility of pure data applications but started to betray the goal of more complicated applications.
In a lot of ways, the new tools added to the data stack were an excuse to not communicate outside of data. Data teams have always had an inconsistent home. Is it in Product, Eng, DevOps? But ambiguity is not a reason to start doubling up on responsibilities and tools. The truth is if you treat data applications as entirely separate from engineering applications on the critical path, those data applications will never be trusted outside of data.
Building any level of trust means going beyond yourself. It may mean holding the data team to some of the standards of engineering and not just doing everything that makes the data team feel the most comfortable. Not every member of the team needs to become a software engineer but it would be good to try and learn some SWE habits. One of which is the importance of working within parameters. You may not always make the perfect decision but you need to know enough to not paint yourself into a corner. Part of maintaining a system means you can’t rewrite it whenever you hit a wall.
So what will the next year look like in data tools? I think we are going to go back to behaving like boring practitioners who work with confines. We may have to use tools not entirely catered to us. There will still be analytics tools but I think they will behave in ways that are recognizable to both data and engineering teams. This will allow a number of applications to be built on top of them and help break down silos.
We should encourage looking across teams to see what we can leverage. We can also try to return the favor. On data teams, I have often found the most successful tools draw interest from outside of data. When rolling out orchestration it was not over when analytics jobs had been migrated but when engineers started to move their own jobs onto it. When data applications start to resemble any other application, you can build some really interesting things and some things you didn’t intend at the beginning. You may also find that when something has the approval of the entire organization, it is much harder to cut when times get tough.
This is essay #1 in the DPE Symposium on Is the Orchestrator Dead or Alive?
I love this line: “In a lot of ways, the new tools added to the data stack were an excuse to not communicate outside of data.”
I tend to think of data teams as having business roots and engineering chutes, meaning they tend to absorb engineering “rays” but be anchored y business concerns. This means they can only get incidental resources from the engineering org -- that is, they adapt to not having a say in prioritization.