2025-06-26
DuckLake is an integrated data lake and catalog format – DuckLake and DuckLake: SQL as a Lakehouse Format – DuckDB, also discussion on HN.
Their own description of this project positions it as an open-source version of the pattern that BigQuery and Snowflake use for their data lake capabilities, and as a reaction to other data lake standards like Iceberg and Delta Lake. I haven’t been paying enough attention to this area for years to be able to evaluate those claims.
Some of the discussion on HN, like this, is getting at my questions and confusions. I think the design of this data lake is such that it’s a replacement (?) for distributed compute engines like Spark? It seems like all updates to the data sets must go through the catalog database — like, yes, the data is stored as Parquet in object storage, but reading or modifying the data must happen through the “catalog” database? Why is it called a “catalog” then, instead of a “query engine” or similar? I feel like I’m missing something here.
I’m also getting the impression that distributed query tools like Spark are falling out of favor. The HN comments make it sound like most people are just taking the “use a giant node” approach, which, fair, is actually a great approach. I guess Spark will remain relevant, but maybe only at the front edge of a data problem or for truly enormous stuff.