Docker for Data Products

Spoiler: It's just Docker.

Sep 23, 2024

Every once in a while, I check to see if anyone knows what a Data Product is yet. This last time, after getting ChatGPT to admit that an AI-powered Astrological Fitness app would appropriately be classified as a “data product,” I thought I’d revisit some of the data mesh literature. That’s how I stumbled on NextData’s slogan: “We do for data what containers and web APIs did for software.”

Sounds great! Docker and APIs make it easy to compose software services and publish them world-wide. Having a standard process for packaging and deploying data products would make life easier for everyone who relies on them.

But wait — what makes a data product different from a regular product, again?

It certainly isn’t the presence or absence of data. An application without its data is a dehydrated skeleton of itself. It isn’t that software engineers lack some particular data skillset. Software engineers built the databases that manage data products.

Actually, in the Data Mesh book, Zhamak addresses this point (emphasis mine):

The difference between data product ownership and other types of products lies in the unbounded nature of data use cases, how particular data can be combined with other data and ultimately turned into insights and actions. At any point in time, data product owners are aware or can plan for what is known today as viable use cases of their data. At the same time, there remains a large portion of unknown future use cases for the data produced today, perhaps beyond their imagination.

Data products, then, are software products with fewer boundaries: open building blocks for others to access. In contrast to an application’s transactional database intended to serve only itself, a “data product database” might allow dozens of other use cases on top of it, some of which we can’t even plan for.

I can imagine at least one use: for making more data products, whole seething swathes of data products, dashboard graveyards, spaghetti DAGs, data swamps—new types of data morasses beyond our imagination. If there’s one thing data is good for, it’s for making more data.

Don’t get me wrong. I’m a big fan of data mesh. I loved the book, bought the shirt, and split my house into domains. I do think decentralizing data, providing self-serve infrastructure, baking in computational governance, etc., is the way.

I’m getting old, though, and cranky. Not only will the 2025 Machine Learning, AI, and Data Landscape feature over seven thousand vendors, but great minds are writing manifestos about breaking down data systems even further. Now that I can pitch an open table format, I must learn about open intermediate representations for query compilers. Throw in AI services — data products? — and it’s clear the landscape is getting weirder, not clearer.

What does it mean to containerize a Kafka topic, a CSV, a dashboard, a chatbot, or an API endpoint in the same way? Yes, we can index them, layer on consistent metadata, and add an “export to CSV” button, but this seems fundamentally different than what Docker does for us. It’s more like domain name registration or data cataloging; it solves for discoverability more than developmental friction.

In this blurry, software-eaten world, we are all generating production data all the time. A Google Sheet that can run arbitrary Python code but is managed by a 23-year-old accountant is a software service. A Streamlit app that does the same thing is a software service. And yes, so is a Django, Springboot, or iOS app. Dashboards and spreadsheets are a distinct class of software, yes, but the data that’s generated is still enmeshed in it. Even the digital assets — Slack, email, slide decks— that drive the business forward are software. We are living in prod.

I don’t know; I don’t know. Maybe it makes sense to carve out data as separate products. Maybe it makes sense to build out separate development lifecycles for each sub-specialty: analytics, machine learning, streaming data, transactions, REST, RAG, search. However, a genuinely radical view of “decentralized data management” would extend to the whole of the Internet: everything resolvable by a URL is a data product.

That would mean we don’t need new deployment processes but more skepticism and a tolerance for weirdness. We deal with this on the public Internet: we figure out what we trust and what we don’t, and it’s all pretty terrible, but we get along alright in the end.

It’s a pessimistic view, I admit. But is it so hard to believe that in the future, the piece of information that tips an AI decision-maker one way or another really could come from anywhere? This post. A cat video. An empty slide deck. That, at the end of all our best efforts to organize the world, all we’ll be left with is Google Drive: an infinite chasm of data objects that can only be navigated by direct link or sorting by recently updated.

Jorge Gomes

Sep 26

Ok. Where does docker come in? 🤔

Expand full comment

Gordon Wong

Sep 24

this is a really good question and I think lies close to the heart of why so many data projects fail. If we can't define what a data product is, then what's our mission anyway?

Zhamak's answer is a bit open ended but its not wrong. Thinking through the difference between a data product and a typical application, two big things jump out to me:

1) Applications are much more deterministic. They are built to solve a specific need. Data products are much more opportunistic. They try to address emerging need

2) Applications are largely standalone and provide a guided experience. You're suppose to use them in a specific way. Data products are incomplete. Or rather they are incomplete without the user. The user is PART of the real product.

1 reply by Stephen Bailey

5 more comments...

Data People Etc.

Discussion about this post