Worldbuilding with data

The raison d'etre for data engineering

Feb 11, 2025

Storytelling with Data is one of those books I cherish, not because it unlocked a new level in my career or turned my world upside down, but because it perfectly captures an optimistic data vibe that takes me back to my start in the field.

I picked it up as an eager PhD student in an attempt to turn my posters into something that elicited more than a “hmm, interesting.” It was an idyllic time of life. I had all I needed: a terabyte of brain scans, a high-performance compute cluster, and both fig and ax from matplotlib. Data was beautiful, I believed, even if no one said that about my data.

Storytelling with Data marked the beginning of the end of that belief. Its premise is simple: if you’re not making a statement with your data visualization, you’re wasting ink. If you are making a statement, you can be more effective by following principles based on physics, psychology, and design theory.

The point is that data ought to have a point. “Charts” are dead things, a by-product of our time; they clutter. “Stories” are alive and perennial. They will never go away.

My problem was I didn’t want to tell stories. I had fallen in love with the methods. I wanted to process the data, tweak parameters, parallelize the jobs, modularize code, and benchmark performance. I have a Morlock brain, but science is an Eloi career. In the intervening years, I’ve tunneled further away from storytelling, descending into the sewers and rancid places where data is not beautiful but wretched.

A question has nagged at me, though. If storytelling is the telos of analytics, that essential human capability we enhance through technique and information, what is the corresponding telos of data engineering?

What’s it all for—the ETLs, medallion architectures, clustering keys, event streams, data contracts, entity resolvers, data meshes, data contracts, and metadata sinks?

If a team needs an application database stabilized, they turn to infrastructure engineers; if a founder wants to build a whole new flavor of the database, they’ll invest in software engineers. If a business needs to create a process around a data system, it’ll hire an operations person; if it needs insights, it’ll hire an analyst. Only when the business wants to hoard data with consistency, reliability, and scale, does the data engineer get the call. Why?

Accumulating data is the first step towards generalized intelligence—in this case, business intelligence. However, this is not the only type of intelligence: the surveillance state creates strategic intelligence, the scientific process yields scientific consensus, and a crystallized Internet backs large language models. In all cases, a data-generating process creates a dataset that one or more actors then operate on. Data constrains the questions that can be asked, the stories that can be told, and the worldviews that can be adopted.

This systematic collection and ordering of data is worldbuilding.

If storytelling is the text, then worldbuilding is the context. Many stories can be told about a single world, and all of them are better when that world is vibrant, descriptive, and complete. A fragmented and disjointed world creates tension; the audience never knows what might happen next and becomes skeptical.

What’s new in the last ten years is the “real world” has become less and less the default world that people live in. The phrase “single source of truth” was coined in the 1980s but wasn’t popular until the Internet era. As people adopt data systems that model their local concerns with high fidelity—sales, finance, marketing, engineering, and operations—it takes more effort to stitch them together into a coherent world. Truth is a corporate concern.

When little worlds get connected at a scale, confusing emergent properties appear. Namespaces are limited, forcing collisions. (“What’s the definition of a user?”) Concepts evolve, creating compatibility issues. (“Which version is this?”) Cost and performance oppose comprehensiveness and fidelity: you can’t see everything, everywhere, all at once. And, of course, there’s timestamps.

A key objective in data engineering is to facilitate “real-world thinking” despite “artificial world physics.” Data is the lens through which some agents view the world. The difference between brilliance (high correspondence) and dullness (low correspondence) is a difference in data. Making a virtual world correspond to—and then feel like—the real world takes enormous effort.

When people can’t see the world-generating process, they assume they are operating in the real world.

When data systems are insufficiently clear (or when the user is insufficiently trained), users assume they are dealing with real-world properties even if operating in an artificial world. This means the wrong decision is made, whether by a human or automation. Increasingly, that error could mean anything—a loan denied, an application rejected, a turn not made, or a person arrested.

Worldbuilding sounds fantastical, as if it is incomplete without a magic system and mythology. That’s fair. Search for worldbuilding books, and you will find more from Dungeons and Dragons than O’Reilly. But it’s not all that different—does a ten-point magic system look all that different from a metric tree?—and it’s literally true that the software systems that guide our moment-to-moment experience are artifacts of some person’s imagination.

The storyteller wants to smooth out all this artificial complexity; it’s just context. To the worldbuilder, though, the rough edges are the text. Post-Internet, we live in little virtual worlds, and respecting underlying principles is critical to making them habitable. They ought to be legible and consistent, and they ought to make their assumptions explicit. The accounting should check out. That takes effort, and sometimes, it takes engineering. Lacking it, they won’t endure.

Like storytelling, this effort did not start with data. It’s not an artifact of Hadoop, microservices, or the relational database. It has a lineage stretching back millennia, from cartographers and economists to political scientists and myth-makers. And, of course, to more recent innovations: the printing press, modern engineering standards, and information theory.

Data engineering is simply a new venue for worldbuilders to exercise their passion. Given its job demand and salaries, I can’t say it’s underappreciated. Still, most people don’t realize just how transformational setting up a world’s data grid is to the health and happiness of its residents or how profound a philosophical statement a tidy spreadsheet is.

I want more people to realize it!

For 2025, I plan to channel more of my writing energy towards this idea.

What makes a world? How do worlds start? How do worlds end?
Who builds worlds? For what reasons—beneficent and nefarious?
What principles of physics, psychology, and language underpin compelling worlds?
When should you build a world? How much do they cost? What are the alternatives?

The modern data engineer is not the only person doing worldbuilding. Unlike others, though, data engineers disproportionately deal with large, multi-source worlds—and are on the hook to ensure they don’t break. Data management in the enterprise—or any interesting business—is one of the most complex, gnarly, realistic worldbuilding problems a person can face.

In other words, while many build worlds, data engineers build them at scale.

Data People Etc.

Discussion about this post