Data's final format

Iceberg™ may have won the battle, but it will never win the war.

Dec 09, 2024

With the release of Amazon’s S3 Tables, it seems that Iceberg™ has won the “format wars” of the last few years. It’s a big moment. Data professionals everywhere can finally manage data with confidence, trusting that their employer’s massive information stores can be created, read, updated, deleted…

Wait. Hmm.

So, I know that Iceberg™ is the coolest way to manage a data lake—scalability, time travel, dynamic partitioning, openness. I have wanted to try it out, though there were always flashier toys to play with first. Of all the hype waves in the past few years, open table formats have been the least, well… hype-inducing.

However, given the recent news, I decided to cordon off an hour to stand up a few Iceberg™ tables—to get a glimpse of the future of table management.

Initial Google results made it clear I was in for a journey. Dremio promoted a “10-part web series” on mastering Iceberg™ alongside entire courses run by Starburst, Databricks, Udemy, and DataCamp. The top Medium article ominously labeled itself "an introspection,” suggesting I was about to second-guess my choices. Next followed a YouTube video from Dremio, a Reddit post from Dremio, an actual tutorial from Apache project, and Dremio’s actual docs site.

Finally, I found a few tutorials from Snowflake, Iceberg™’s chief ally, that fit my needs well. Here, too, I had options: should I integrate Iceberg™ with AWS Glue, Microsoft OneLake, Coalesce.io, Snowflake Cortex, or Spark? Apparently, Iceberg™ is best served with a little something on the side.

I spent the rest of my hour fumbling with principal roles, external volumes, updating, network policies, and catalog objects. If I strayed from the path, I hit errors: “variant columns are not supported” or “column definitions are only available with the Snowflake catalog.” By the time I reached the climactic “select * from my_iceberg_table,” my hype had frozen in my veins.

I can report, however, that yes, Iceberg™ tables work. They may be less feature-rich than the cloud provider’s native formats; they may require additional work to set up; and the effort may catalyze an introspective blog post. But as far as table formats go, Apache Iceberg™ is as exciting as they come.

In other words: it’s not that exciting. At least, not in the way we’ve been trained to expect from technological innovation.

There is little magic in the Iceberg™ experience at the user/operator level. The magic is supposed to happen at the ecosystem level, across vendors and projects and workflows. Codeium’s Windsurf will build an application for you in seconds; Iceberg™ will let you and dozens of others build a sprawling data labyrinth in years.

Table formats do not lend themselves to demos, anyway. Iceberg™ tutorials are like putting different objects on a kitchen table. “See how efficiently it supports these flowers? How it holds these plates at the same time? Observe—these chairs can be reserved for specific people.” The memorable part is not the demonstration but dragging it through the back door.

I imagine the race to build a transcontinental railroad looked similar at a technological level. The effort was transformative and required standardization for railroad ties, tracks, nails, spacing, and the like, but these were mere implementation details. Teams solved complicated engineering problems—bridges, tunnels, and junctions. The real change, though, was that of scale: it was a massive effort that bent the land to the economy's will, making the continent fit for an ever-growing number of machines to traverse it efficiently and independently.

You can’t adequately demo that type of project. Reducing its size for a demo fundamentally changes its nature. It becomes a toy rather than a glimpse of the thing itself.

So, too, with Iceberg™. Iceberg™ is better understood as a response to a socioeconomic problem as a technological one. From its creators: “shared storage disrupts the monolithic business model that has dominated the database industry since its inception.”

The immediate pressure may be the business model, but underneath that is a desire to make the Internet more traversable by engines. The more open the data, the easier it is to capitalize on it. If the landscape is not appropriately paved, future agentic swarm intelligences may have to invest to spin off their own agentic data teams to curate the data themselves.

The “format wars” may be remembered as a skirmish between Databricks and Snowflake, but the grandchildren will not remember that any more than we remember the animosity between Union Pacific and Central Pacific Railroads. Instead, they’ll inherit the legacy of the Internet refining itself in its hunger. Iceberg™ may configured as a product, pitched as a community, sold as a business, packaged in content, but it’s fundamentally a world-harnessing effort to support even more compute and storage.

So—I want to say a word of thanks for a wretched enemy and faithful friend: the CSV table format.

Unlike Iceberg™, which is so enmeshed in our momentary cloud milieu that understanding it requires a ten-lesson course, the CSV is documented primarily by its name: comma-separated values. (Even then, it’s gentlemanly enough to support any delimiter its guests desire.)

The CSV format is humble. Its strategy for metadata management is column names. It tackles data sharing and role management through the copy/paste protocol. It’s unaware of context, the user, or cloud providers. It leaves it to the user to follow instructions; it leaves it to the reader to fix the user’s errors and inconsistencies. CSV has no AI strategy.

Dremio doesn’t buy ads for CSV tutorials (although they admit that “CSVs play an integral role in the ingestion, storage, and processing of data within a data lake infrastructure”). CSV doesn’t need advocates or pumpers; it doesn’t need to wage a format war. It’s already won.

I admire CSV because it is a human solution to a machine problem. It is one of our first attempts at human-to-machine communication. It was an innovation over punch cards, arising organically in multiple places in the 1960s. By the 1970s, it was accepted as a standard—even though no formal standards existed until 2005.

CSV is a convergent idea: it’s the organic way that the human animal thinks about encoding information into a table. It satisfies the eye’s need to use space efficiently with tables and rows and the temporal lobe’s love for serial sequencing. Using the comma is no accident: it’s simple, concise, natural.

If Iceberg™ is the transcontinental railroad—an economy-level consensus-building effort to harness the world’s data for more efficient processing—then CSV is a trail hacked by machete. It’s a frontier data format, a base state all data processing systems fall back to on failure, like `cron` for time or `touch` for feel. There are no trademarks in the places where CSV thrives.

There’s much talk about Excel’s longevity as it relates to the current crop of analytics tools, and I’m on board—the common people use Excel, and the common people will be around long after the specialists die out. But even spreadsheets are one step higher on an evolutionary program that selects the fittest data interfaces. As long as there’s value in data, it’ll be separable by commas.

id,call_to_action\n1,thanks for the support!

Data People Etc.

Discussion about this post