The Modern Data Graph

Now that we're stacked, it's time to get... graphed?

Dec 21, 2022

This is Part 4 in a series titled: Knowledge isn’t Power.

Some people hate the modern labeling of the Modern Data Stack. They claim what’s labeled as modern is actually rehashed, what’s branded as innovative is actually irresponsible, what’s touted as fresh is actually a plot by the deep state to harvest our brains and hijack our laptops to mine crypto.

Keep the modern, I say. Fight the inexorable march of time. Believe that, this time, we’ve converged on a solution that will endure more than two years.

My beef is with the stack.

Like crushing eight weeks of P90X and pounding creatine shakes for breakfast, our stack obsession has transformed a bunch of shy nerds into preening meatheads. Geeks, once content with storytelling and role playing, now size each other up in the locker room. Bro, did you see Kevin’s warehouse? It’s stacked, bro, think he’s using Fivetran? With those OOTB dbt packages? Dude is a killer, bro.

The real jocks chuckle and shove them in lockers.

The vendor messaging is sleazy, unhealthy. No, I don’t want to double my stack size overnight, and my stakeholder love life is just fine, thank you. Yes, I’d like my stack to be reliable, but do I really need to keep it up all night long, every single night? How can that possibly be good for me?

When the business asks about the stack — or rather, doesn’t — our pecs stop bouncing. What was this all for again? The technical architecture of the data platform is largely a sideshow to the business. It’s too niche, it’s too monolithic. You can only fight about stacks between companies.

So, the stack, as a useful symbol for the way data supports the business, has lost its mojo. It is still useful as an almanac of Things You Might Want To Do, but the territory has been mapped, the monsters debunked or murdered. It was once virile: we have plenty of early-stage companies to show for it. Now, it’s limp.

So, what’s next?

I make only one assumption here: that there is a thing tended by data professionals which is bigger than individual data products. That this thing is important but hard to detect. A good metaphor ought to make this thing more concrete, less subtle.

The familiar analogies — water management, fossil fuels, warehouse distribution — are hopelessly cliche. They are from a different era, when there was time for planning, when there was less abundance, less saturation.

Political and environmental metaphors (authority, democratization, exhaust) extend to society, but they are disconnected from the technical decisions that constrain what’s possible. The stack gets that right, at least: data work is technically rigorous.

Data mesh, on the other hand, breaks into new symbolic territory. Meshes are organic, uneven, and flexible — just like the business. As a graph, the mesh is mathematically interesting and aligned with the fundamental problem of data management. And with its principles of data products and interfaces, it unites both the technical assets and human agents.

In the mesh, data serves a purpose. It is an organized way of communicating between domains. Its substrate is interesting — a web of humans and systems with deliberate interfaces — and its purpose is broadly relevant. Actively managing this web is critical to the business.

But there’s a problem: data mesh is not a general-purpose metaphor, but a principled architectural pattern and ongoing culture war.

What I’d really like to do is take data mesh, peel its skin, drench it in butter, boil it for 1000 words or so, and then mash it. Then drench it in butter again. Maybe then, you and I could view its essence in a way that provides a useful illustration to data organizations small and large, new and old, homogenous and polyglot?

There is no better departure point than the modern data stack itself. And there is no better stack diagram, in my opinion, than the “unified data infrastructure” diagram from A16Z. It is thoughtful and versatile. It covers machine learning, business intelligence, and even operations flows. And it is covered in words, so many big words. Too many words.

Stack-thinking loves taxonomies: classification by domain, kingdom, phylum, class, order, family, genus, species. Stacks are modular: you have components that can be swapped out because they serve the same function. How you mix and match is the interesting part of the illustration.

In graphs, though, we don’t care. Not unless there’s a big group of something or a lot of repeated patterns. Each step is just that — one link in a larger chain.

Conceptually, we can call this a move from a categorical concerns to topological concerns. We care about the texture of the system rather than efficiency of one single part. (Even if that part makes us very happy.) “Only as strong as the weakest link” and all that.

To make this clearer, then, let’s start anonymizing these precious systems.

We’ve now added circles to each component of the stack, and we’ll soon strip them of their identity as well. Before that, though, we’ll need to bring in the edges. (Not all of them, though, because there’s a lot and I’m not getting paid for this.)

What does an edge represent here?

The obvious answer is data movement. Data replication and activation certainly have concrete transfers. But transformation of one model to another is not movement — it’s redefinition. Scheduling one thing to go before another is coordination. Validating payloads against schemas is purification. Understanding something new about the business after looking at a chart is movement of a sort, but not of bits and bytes.

In this graph, an edge represents knowledge transfer. It is one system imparting some bit of knowledge to another, in some way. Policies, processes, definitions are all knowledge that get baked into these systems. Initially, humans do all the baking, and then the systems bake it into each other. While there may be an exchange of data, it’s the intelligence folded into each relationship that matters.

A pair of motifs, or patterns, that were present but not prominent, emerge more clearly now:

While there is an overall left-to-right flow from data sources to output systems, there is a hub of interchange around the warehouse and storage. You could get rid of almost all other nodes and still have a platform that sustains the workfloads.
There is a distinction between systems that operate within the graph (defined edges, inside boxes) and those that work on it (bottom, unboxed). The former tools move knowledge around. The latter are regulators: they track metadata, monitor processes, enforce security. (The orchestrator should also be included here.)

Thinking about motifs, rather than modules, is valuable for data professionals because both the tooling and the businesses change rapidly. Could ETL be gone in a few years? Will a usable ML platform finally emerge? Can we replace the entire middle section with Dagster Cloud and DuckDB? When it comes to the stack, the only constant we can expect is change. But the patterns are likely to remain.

This is where the stack stops, but it’s not where we stop. There’s one more group of nodes that are, for a few more years at least, pretty important: people.

There you are. You and your sneaky colleagues are sitting at the top of the graph, in your proper place as both producers and consumers of knowledge.

Another motif: the systems on the left and right shield most employees from the knowledge distribution processes. These are mostly transactional systems (like Salesforce) and analytical systems (like Tableau), but they also include communications and SaaS tools critical for operations.

The data stack, it appears, is built underground.

Closing the human loop makes clear what stack analogies don’t: the role the data stack plays in an organization is transferring knowledge so humans can use it. The data graph supports the system graph, which supports the human graph.

Graphs differ from stacks in another way, too: they matter.

Every new hire is a node in your company’s organization graph. Every new SaaS tool purchased gets connected to that org graph. Eventually, those tools connect to each other. This graph illustrates business dependencies — and business vulnerabilities.

To change this graph — which is what creating (and using) a data platform does — is to change your organization. You cannot implement data mesh without redesigning the way your organization operates.

This the central tenet of the data graph metaphor: how knowledge flows throughout a group of people determines the character of the group.

Let me sketch out this concept a bit more.

In a given organization there are (typically) four layers of graphs. All of these graphs are connected; we are separating them out as an illustration.

An external social layer (customers, partners, media). The business is trying to act on this layer, either to exert influence or collect revenue. But it might also be reacting to competitors and regulations.
An internal social layer (employees, domains). These include the internal social dynamics of the company, but also the rituals that bring people together. A high trust, low meeting culture creates a different social graph than the opposite.
A superstructure of systems (product surfaces, CRMs, reporting, communications). This is the systematically visible layer: the product, the CRM, the wiki, the ticketing system. Usually, this is where humans spend a disproportionate amount of time.
A substructure of integrative and distributional systems (data, infrastructure, policies, integrations). This include the data platform, or at least, much of the logic that drives the output of the data platform. The definition of a Customer may be defined in a top-floor meeting, but it’s operationalized in the basement.

Every organization has its own version of this structure. You could imagine slicing each of these by different domains as well, in a sort of snowflake pattern.

The high-level distinction is between society and structure. The social graph, obviously, is the important one: it brings in revenue, for one. Plenty of business have started with a structural graph the size of a napkin. The system graph exists to offload burden from the social graph.

The division between the internal and external social graphs is often fluid. However, the differences between sub- and superstructural graphs are stark.

The superstructure’s purpose is to operate, while the substructure’s purpose is to distribute. There is an element of frontend vs. backend here, but I don’t think it catches the distinction correctly. The difference to me is more akin to public vs. private: plumbing, electricity, internet, roads. All of these are critical to the health of the overall system, but only the privately owned buildings are occupied. Nevertheless, they are one graph.

The power of the model is its flexibility. Domain-driven design is popular now, but there’s no reason an organization has to be designed that way. You could have dictator-driven design: all information flows through a central group. You can have disorder-driven design: no systems are shared and the knowledge flows wherever it is willed.

The point is, in a graph world, you can make draw these scenarios out, and make hypotheses about them. You can connect a system to a human, or to a customer, trivially — is it one step away or five?

I want to see fewer definitions and more playbooks. We need ways of sharing information about the problems that arise, and how you might solve them. Data contracts are a maneuver, something you can implement to reroute organizational workflows and achieve a desired result (reuse of a trusted dataset).

What makes data work so challenging is that it can be taught in isolation, but never exists as such. It is tightly coupled to the organizational graph. You can exchange the data graph from one company to another as easily as you could swap out the Tube in London for the T in Boston. The stacks are similar, but what matters are the stops.

To be fair to the stack, it has served the profession well.

Stack obsession has meaningfully improved both the tooling and practices. It’s now possible to move from one company to another, install a few dependencies, and contribute. Data work is more transparent, validated, and reusable than it was a few years ago.

The stack is also a powerful Schelling point for new entrants into the field. An infusion of data labor, coming from all backgrounds, benefitted from a common point of reference. An infusion of capital, coming from all backgrounds, devoured the stack’s boxes, boxes, thousands of labeled boxes.

I’m not saying you shouldn’t spend some time with Tony, or that Terraforming that six-pack is a bad thing. But just because tool selection is the first decision a team makes doesn’t mean it should dominate discourse.

Organizational complexity compounds with use cases, people, regulations, time — as well as tools. The stack is the tree trunk — critical, sure, but also the least interesting part of the tree.

And do you want to know a funny thing about trees?

They’re a type of graph.

Share this post to double your rich-club coefficient overnight!

Data People Etc.

Discussion about this post

Ready for more?